CN113870901A - Voice emotion recognition method based on SVM-KNN - Google Patents
Voice emotion recognition method based on SVM-KNN Download PDFInfo
- Publication number
- CN113870901A CN113870901A CN202111127502.7A CN202111127502A CN113870901A CN 113870901 A CN113870901 A CN 113870901A CN 202111127502 A CN202111127502 A CN 202111127502A CN 113870901 A CN113870901 A CN 113870901A
- Authority
- CN
- China
- Prior art keywords
- emotion
- sample
- svm
- training
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 22
- 230000008451 emotion Effects 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 5
- 230000002996 emotional effect Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 7
- 239000006185 dispersion Substances 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims 5
- 238000001228 spectrum Methods 0.000 claims 5
- 238000003066 decision tree Methods 0.000 claims 3
- 238000007476 Maximum Likelihood Methods 0.000 claims 2
- 230000003044 adaptive effect Effects 0.000 claims 1
- 238000013145 classification model Methods 0.000 claims 1
- 230000003247 decreasing effect Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 claims 1
- 238000012417 linear regression Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 230000033764 rhythmic process Effects 0.000 claims 1
- 230000011218 segmentation Effects 0.000 claims 1
- 238000010183 spectrum analysis Methods 0.000 claims 1
- 230000007704 transition Effects 0.000 claims 1
- 230000001755 vocal effect Effects 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 2
- 210000002569 neuron Anatomy 0.000 description 11
- 238000012706 support-vector machine Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 206010048909 Boredom Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011534 incubation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A speech emotion recognition method based on SVM-KNN comprises the following steps that firstly, an original speech signal is preprocessed; secondly, performing voice enhancement processing by a microphone array delay alignment method; thirdly, extracting the characteristics of the processed data based on the BN-DNN with the SHL structure; step four, selecting the extracted features based on a fuzzy set theory method; and fifthly, emotion recognition is carried out by adopting an optimized SVM-KNN method. By the method and the device, a user can obtain higher speech emotion classification accuracy, the problem of limited optimization under large-scale training samples is avoided, and SVM classification accuracy and recognition speed are improved. On the other hand, the SVM-KNN thought provided by the invention can also be applied to other fields of voice recognition, such as dialect classification field, and provides reference for classification and recognition based on voice signals.
Description
Technical Field
The invention relates to voice emotion recognition, in particular to a voice emotion recognition method based on SVM-KNN.
Background
In the current speech emotion recognition method, a Support Vector Machine (SVM) is proved to be a relatively effective classification tool, but under the condition of large emotion confusion degree, the SVM is still difficult to perform accurate recognition.
For a long time, studies on emotions have been carried out by experts in the fields of physiology and psychology. With the rapid development of artificial intelligence, emotional research in human-computer interaction arouses great interest of experts. In human-computer interaction, people hope that people can communicate with machines more naturally, and the machines are required to understand human emotions, so that the emotion classification and identification are particularly important. In human communication, languages contain rich information, so machines can classify and recognize emotions using languages. Experts do a lot of research and analysis on the speech emotion classification and identification aspects, including establishing a speech emotion database, extracting emotion characteristics, classifying and identifying methods and the like. In order to improve the recognition rate of speech emotion, the predecessors have conducted improvement research on each link, but there is no unified system, and the recognition rate is not very high. MFCC has been used as a recognition feature in the past, but it is not further processed before recognition, resulting in a large amount of redundant information affecting recognition. In order to eliminate such an influence and improve the recognition rate, selection of an appropriate classifier is a key point of research. In order to improve the emotion recognition rate and correctly process the emotion characteristics, it is important to select a proper classification method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a speech emotion recognition method based on SVM-KNN, which performs speech enhancement processing by a microphone array delay alignment method, adopts BN-DNN based on an SHL structure to perform feature extraction, adopts a method based on fuzzy set theory to perform feature selection, and then adopts an optimized SVM-KNN method to perform emotion recognition. Finally, a speech emotion recognition method with high precision and low calculation load is formed.
In order to achieve the purpose, the invention adopts the technical scheme that:
a speech emotion recognition method based on SVM-KNN, different speech signal preprocessing modes, specific feature extraction, fuzzy feature selection and SVM-KNN support vector machine classification, comprises the following steps:
(1) preprocessing an input voice signal; the preprocessing comprises pre-emphasis filtering and windowing framing, wherein a pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95, and the frame length of the windowing framing is 26 ms;
(2) the data of different microphone channels are delayed and aligned by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality:
1) a nested microphone array structure consisting of 9 microphones, in fact 4 sets of linear microphone arrays, each consisting of 5 equally spaced (2.5am.5am, 10am, 20am) microphones, ensures that the frequency domain range of the recorded speech signal is 3003400Hz.
2) Simultaneously, a large model is used for considering that the proportional relation between the microphone distance and the distance from a person to the microphone array conforms to the impulse response of the room with the assumed condition that the sound field is a far field
(3) And (2) performing feature extraction by using the BN-DNN based on the SHL structure, wherein the feature extraction process is as follows:
1) in the experiment, a BN-DNN model is provided with 5 hidden layers, the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of the rest hidden layers is 1024; the input data is a 40-dimensional MFCC bottleneck feature of 11 consecutive frames,
2) the neurons of the input layer were all set to 440(40x 11). The DNN network structure is set to 440- [ 1024-440.
3) And determining the number of the neurons in each group and the sparse group overlapping coefficient alpha of the optimal parameters. The experimental settings are 64, 128, 256, and the overlap factor α is 0%, 20%, 30%, 40%.
4) The sparsity of the network is measured by the proportion of the activation probability h of the neurons equal to 0, and the sparsity is defined as:
wherein D represents the number of neurons in a layer, hAi ═ 1,2, …, D) represents neurons, and a larger sparseness represents a more sparse number of neurons in the hidden layer, that is, a larger weight value. For each model, firstly, the model is trained by using a training set to obtain the activation probability of each layer of neuron, then, the activation probability is substituted to calculate the sparsity of the layer, finally, the average value of the sparsity of all hidden layers is calculated to be used as the sparsity of the whole neural network, and finally, the speech bottleneck characteristic is extracted.
(4) And (3) selecting features by adopting a method based on a fuzzy set theory:
1) in N-dimensional space R, for the class c problem, the training sample set is X ═ X, X:, …, XN }, N is the number of samples, and for the sample X to be measured, K values of K neighbors of the sample X to be measured are firstly determined
2) Determining the distance between the sample to be detected and all training samples, wherein the Euclidean distance is selected to be adopted:
sorting the N distances
d(1)≤d(2)≤d(3)≤.≤d(k)≤d(k+1)≤……≤d(N)
Wherein d (1),. d (K) is the distance between the sample to be measured and the K nearest neighbors.
4) Calculating the class membership of the sample to be detected according to the formula (1), wherein m is a fuzzy weight adjusting factor, and n is 1,2. If ui (x) max { un (x), then judging that x belongs to the ith class, and repeating the algorithm until all samples to be detected are processed.
(5) And (3) emotion recognition is carried out by adopting an optimized SVM-KNN method:
1) let the membership degree of each sample belonging to the class be sqThen the blurred input sample is S { (x)1,y1,s1),(x2,y2,s2)……(xi,yi,si) In which xi∈R,yi∈{1,-1},σ≤s i1 or less, sigma being a sufficiently small positive number siIndicating the degree to which the ith sample belongs to the positive class.
2) Introducing transformation 0: R → F under the nonlinear condition, mapping the sample from an input space R to a high-dimensional feature space F, and determining an optimal classification hyperplane in the high-dimensional feature space by using a structure risk minimization principle and a classification interval maximization idea, so that the problem of solving the FSVM optimal hyperplane can be converted into the following optimization problem
ξi≥0,i=1,^…,1.
3) Establishing a lagrange function
Wherein μ i >0, β i >0 are Lagrange multipliers, C0>0 are penalty factors, and w is a weight coefficient of the linear classification function y.
4) The following dual planning problem results.
0≤μi≤siC0,i=1…l. (5)
Wherein k (x)i,xj) For the kernel function, consider the KKT condition, corresponding to sample x where μ ═ 0; for samples that can be correctly classified, i.e. not supportedAnd (5) vector quantity. Correspond toIs the support vector on the boundary, i.e. the correct partition where the sample xi is located on the boundary of the interval.
Drawings
FIG. 1 is a flow chart of speech signal enhancement according to the present invention;
FIG. 2 is a flow chart of the multi-level SVM classifier of the present invention;
fig. 3 shows the SVM-KNN classification steps of the present invention.
FIG. 4 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1,2 and 3, a speech emotion recognition method based on SVM-KNN, which comprises the following steps of different speech signal preprocessing modes, multiple feature extraction, fuzzy KNN feature selection and SVM support vector machine-K nearest neighbor classification:
(1) preprocessing original data; the method comprises the steps of pre-emphasis, framing, windowing and end point detection.
1) After the signal s (1) is added to the bed, the signal s (1) becomes sw (n), and the formula is as follows: sw (n) ═ s (n) × w (n)
2) The data is filtered by a filter, and the pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95
(2) And designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality.
1) A nested microphone array structure consisting of 9 microphones and 4 groups of linear microphone arrays consisting of 5 microphones with equal distance (2.5am, 5am, 10am and 20am) respectively, thereby ensuring that the frequency domain range of the recorded voice signal is 3003400Hz.
3) The method uses a large model to take into account that the proportional relationship between the microphone spacing and the distance from a person to the microphone array conforms to the impulse response of a room with the assumed condition that the sound field is far field.
(3) And performing feature extraction on the processed data based on the BN-DNN of the SHL structure.
1) Setting 5 hidden layers in a BN-DNN model, wherein the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of each hidden layer is 1024; with 40-dimensional data of 11 consecutive frames as a bottleneck feature of the MFCC,
2) the DNN network structure is 440- [ 1024-440.
3) And determining the number of the neurons in each group and the sparse group overlapping coefficient alpha of the optimal parameters. The experimental settings are 64, 128, 256, and the overlap factor α is 0%, 20%, 30%, 40%.
4) The larger the activation probability in the neuron is, the more sparse the representative network is, and the sparsity is defined as:
and finally, calculating the average value of all the hidden layer sparsity as the sparsity of the whole neural network, and finally extracting the voice bottleneck characteristic.
(4) And (3) selecting the extracted features based on a fuzzy set theory method:
1) the energy, short-term amplitude, short-term zero-crossing rate and pitch frequency features are extracted using a function.
2) And forming a feature vector by using the extracted feature parameters as input of a fuzzy set.
3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i 1,2.. C, j 1,2.. N, N are the number of emotion feature parameters), and then, each feature parameter Mjm (N is a sample in the emotion state, N1 is a first sentence, and so on) of each speech sample in each emotion state is normalized, wherein the normalization formula is as follows
Calculating the dispersion of the characteristic parameter under a certain emotion:
5) after the dispersion of each characteristic parameter under each emotion is solved, the contribution degree of each characteristic parameter under each emotion is calculated according to the dispersion degreeiDegree of contribution u ofij:
And weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the fuzzy K nearest neighbor is used for distinguishing the sample to be classified.
And finally, extracting the features which have the largest contribution to emotion recognition.
(5) And (3) emotion recognition is carried out by adopting an optimized SVM-KNN method:
1) and constructing a speech emotion recognition model according to a multi-stage classification strategy.
2) The emotional confusion degree is the similarity degree of two different emotions.
Defining an ith emotion Bi and a jth emotion B; has a degree of confusion of IijThe specific meaning of the method is that the probability of judging the ith emotion as the jth emotion by mistake and the average value of the probability of judging the jth emotion as the ith emotion by mistake, and the mathematical expression is as follows:
wherein x is the test data, and t is the identification result corresponding to the test data x.
3) The construction algorithm of the multilevel classification comprises the following specific steps:
a. calculating a speech emotion recognition confusion matrix by using a traditional Support Vector Machine (SVMD) method;
b. constructing a first-level classifier, setting the probability P1 of the first-level classifier, and classifying the emotion with the confusion degree exceeding the probability P1 into one class, namely if Iab>Pl,Icd>P1Grouping a and b into one group, and grouping c and d into one group; if Iab>PlAnd Ibc>PlThen a, b, c are grouped into one group.
On the basis of the completion of the construction of the superior classifier, when constructing the second classifier, the probability P of the second classifier is set again2If I isa>P2And Ibc>P2A.b.c are also grouped together. This text when designing the first stage classifier PlSet to 10%, then each class of classifier probability is sequentially incremented by 2% based on its upper class of classifier probability, e.g., P' of the second class classifier is set to the first class classifier probability PlOn the basis, sequentially increasing by 2% on the basis, namely sequentially increasing by 10%, 12%, 14%, 16% and the like, and so on;
c. calculating the emotional confusion degree of the non-grouped emotional states according to the formula (1.1), turning to the step b, and classifying the non-grouped emotional states into the existing group or the independent group;
d. and all four emotions are correctly grouped and finished.
Examples
Step 1: the method for preprocessing the raw data comprises the following steps:
(1) in the present embodiment, an EMO-DB data set is used, which is a german emotion voice library recorded at the university of berlin industry, and is obtained by simulating 10 sentences (5 long and 5 short) with 7 emotions (neutral/nertral, angry/anger, fear/fear, happy/joy, sad/sadness, disgust/distust, boring/boredom) by 10 actors (5 men and 5 women), wherein the 10 emotions totally contain 800 sentence corpus, the sampling rate is 48kHz (post-compression to 16kHz), and 16bit quantization is performed. The selection of the corpus text follows the principle of neutral semantics and no emotional tendency, and is a daily spoken style without excessive written language modification. The recording of the voice is finished in a professional recording studio, and actors are required to perform emotion incubation by recalling real experience or experience of the actors before deducing a certain specific emotion, so that the reality of the emotion is enhanced. After 20 participants (10 men and 10 women) had performed the listening discrimination test, 84.3% was obtained.
After the data set is tested by listening and distinguishing, 233 sentences of male emotion sentences and 302 sentences of female emotion sentences are reserved, and 535 sentences are obtained. The sentence content comprises 5 short sentences and 5 long sentences of the daily life phrases, has higher emotional freedom degree and does not comprise a certain specific emotional tendency. And 16kHZ sampling is adopted, 16 bits are quantized, and the file is saved in a WAV format.
(2) Pretreatment: and setting a pre-emphasis coefficient, performing pre-emphasis processing on the voice signal, and setting a windowing framing frame length to perform framing processing on the voice signal.
Step 2: and selecting specific channels of the preprocessed data, and delaying and aligning the data of different microphone channels by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality.
(3) And (3) extracting features by using BN-DNN based on an SHL structure. And setting the number of layers and parameters of each layer, and constructing the BN-DNN neural network. Training a neural network, inputting the MFCC, and finally outputting a voice bottleneck characteristic.
(4) And carrying out fuzzy feature selection on the extracted multiple features. And calculating the class membership degree of each extracted feature according to a formula, and judging that the x feature belongs to the ith class until all samples are processed.
(5) And performing emotion recognition by adopting an optimized SVM-KNN method. And calculating the emotional confusion degree and classifying the emotional confusion degree into the corresponding groups.
(6) And (5) counting the accuracy to obtain a final result.
Claims (1)
1. A speech emotion recognition method based on SVM-KNN is characterized by comprising the following steps:
(1) preprocessing original data; preprocessing original data by using a method of pre-emphasis, framing, windowing and end point detection;
1) the high-frequency part is improved by utilizing a pre-emphasis technology, so that the frequency spectrum of the signal becomes flat, and the frequency spectrum analysis or the vocal tract parameter analysis is facilitated;
2) performing framing processing on a voice signal; in order to make the transition between frames smooth and keep the continuity, an overlapped segmentation method is used, and one segment is intercepted every time the frame is moved, so that the frames as many as possible are obtained, and the short-term analysis is convenient;
3) multiplying s (n) by a window function w (n) to form a windowed speech signal sw (n) s (n) w (n);
4) accurately finding out a starting point and an ending point of a voice signal from a section of voice signal so as to separate an effective voice signal from a useless noise signal;
(2) designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality;
1) estimating a noise power spectrum of an input speech signal by using a first-order recursive smoothing method;
2) calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the voice signal with noise;
3) smoothing the noisy speech signal to obtain a smoothed power spectrum S (x, k) of the signal, and performing minimum value search on the smoothed output signal to obtain Smin(λ,k);
4) Solving the probability I (x, k) of the existence of the voice signal, performing secondary smoothing and minimum value search according to the probability, and calculating to obtain the probability q (lambda, k) of the existence of the voice signal;
5) according to the formula
6) Updating time-varying smoothing parameters and a smoothing noise power spectrum;
(3) extracting the characteristics of the processed data based on BN-DNN of an SHL structure;
1) firstly, extracting 39-dimensional MFCC characteristics (13+ delta) from 1-hour Vystdial _ cz, training a three-tone GMM model, and performing forced alignment;
2) training a three-tone GMM acoustic model (13-dimensional MFCC features are spliced for 9 frames, LDA is reduced to 40 dimensions) based on Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transformation (MLLT), wherein the number of Gaussian mixture elements of the model is 19200;
3) then, carrying out Speaker Adaptive Training (SAT) by using a Feature-space maximum likelihood linear regression (fMLLR) technology, thereby forming a GMM acoustic model of LDA + MLLT + SAT;
4) obtaining a training target of a softmax layer in the BN-DNN in a mode of forced alignment of the model; the training feature of the DNN uses fbanks feature with good effect, firstly, 40-dimensional fbanks feature is extracted, 11 frames of splicing (5-1-5) are carried out, and the obtained super vector is used as the input feature of the DNN;
5) performing 10 rounds of RBM pre-training on each hidden layer (including a BN layer), then performing fine adjustment on global parameters by using a BP algorithm, and finally extracting three major characteristics of rhythm, tone quality and spectrum;
(4) selecting the extracted features based on a fuzzy set theory method;
1) analyzing the short-time energy, the short-time amplitude, the short-time zero-crossing rate and the pitch frequency of the extracted features by using the function TimePara (), and extracting the pitch frequency by using the function FunFre ();
2) after short-time energy, short-time amplitude, short-time zero-crossing rate and fundamental tone frequency are respectively extracted, the extracted characteristic parameters are combined into a characteristic vector to be used as the input of a fuzzy set.
3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i;;;;;;;;;;. C, j;. 1, 2;;;;;. N, N are the number of the emotional characteristic parameters), and then normalize the normalization formula as shown in the formula, respectively, for each characteristic parameter Mjm (N is the sample in the emotional state, N;. 1 is the first sentence, and so on) of each speech sample in each emotional state
4) Then calculating the dispersion of the characteristic parameter under a certain specific emotion:
5) after the dispersion of each characteristic parameter under each emotion is solved, calculating the contribution degree of each characteristic parameter under each emotion according to the dispersion; characteristic parameter thetaiDegree of contribution u ofij:
Weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the sample to be classified is judged by using fuzzy K nearest neighbor;
finally, extracting features which have the largest contribution to emotion recognition;
(5) performing emotion recognition on the voice features by adopting an optimized SVM-KNN method based on the extracted features;
1) decomposing the 6 emotion classification problems, establishing a multi-stage SVM classifier based on a decision tree, identifying an emotion by the SVM of each stage for a sample set, entering the remaining sample set into the SVM of the next stage for identification, gradually decreasing step by step as shown in FIG. 1, and finally classifying the leaf nodes of the decision tree into the obtained emotion;
2) for misclassified samples generated near the hyperplane of the SVM, a KNN algorithm is utilized for combination, and an SVM-KNN combined classification model is constructed to improve the accuracy of the SVM; SVM-KNN classification step:
firstly, initially considering that samples in a training set are all marked, randomly selecting a small number of samples from the training set, and constructing a small sample training set to ensure that each emotion in the initial training sample set at least comprises one sample;
obtaining a weak SVM classifier of emotion A according to an initial training sample, then determining an optimal classification hyperplane of the weak SVM classifier, and supporting a vector set T, a coefficient W of a classification decision function and a constant b;
selecting a sample from A-class emotion, calculating the similarity of the sample to all samples of non-A-class emotion, selecting n samples most probably not to be A-class emotion, and marking the n samples as a sample set A; selecting one sample from non-A samples, calculating the similarity of the sample with all the samples of A emotion, selecting n samples most likely to be A emotion, and recording the samples as a sample set B;
the samples in A and B are points near the hyperplane, and the x samples in A, B are used as human decision functions
g(x)=∑iyiaiK(xi,xj)+b (6);
If | v | > e, the classification accuracy and reliability of the sample points by the SVM are high, so that the class of the sample points can be determined by the decision function f (x) ═ sgng (x);
(vi) not air<e, the sample point is near the hyperplane, the classification reliability is low and it is easy to be classified, therefore, pass KNN 4]The method determines the category to which the sample x belongs; taking a support vector set T of A type and non-A type as a training sample, and calculating the distance d (x, x) between each vector in the sample x and Ti) The class to which the vector closest to the sample x belongs is taken as the class of the sample x,
in the formula, x; is a support vector; k () is a first order polynomial kernel; the range of the threshold e is [0,1], the specific value can be dynamically adjusted according to the experimental result, the initial value is generally set to 1, and if the initial value is adjusted to 0, the algorithm is the traditional SVM algorithm;
seventhly, putting the samples obtained by SVM classification and the samples obtained by KNN classification into an initial training set to expand the samples, and training a new SVM2 on the basis of the expanded training set;
iterating until all samples in the training set are added into the initial training set, and stopping iteration; obtaining an SVM classifier with high classification precision on the A-type emotion by using the final training set;
ninthly, training a first-level SVM classifier in the decision tree at the moment, and then using a non-A-type sample set as a trainer of a next-level SVM; and training step by step to obtain the SVM classifier corresponding to each emotion type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111127502.7A CN113870901B (en) | 2021-09-26 | 2021-09-26 | SVM-KNN-based voice emotion recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111127502.7A CN113870901B (en) | 2021-09-26 | 2021-09-26 | SVM-KNN-based voice emotion recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113870901A true CN113870901A (en) | 2021-12-31 |
CN113870901B CN113870901B (en) | 2024-05-24 |
Family
ID=78994361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111127502.7A Active CN113870901B (en) | 2021-09-26 | 2021-09-26 | SVM-KNN-based voice emotion recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113870901B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107492384A (en) * | 2017-07-14 | 2017-12-19 | 北京联合大学 | A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm |
CN108899046A (en) * | 2018-07-12 | 2018-11-27 | 东北大学 | A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification |
CN109036468A (en) * | 2018-11-06 | 2018-12-18 | 渤海大学 | Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core |
KR20190102667A (en) * | 2018-02-27 | 2019-09-04 | 광주과학기술원 | Emotion recognition system and method thereof |
CN111832438A (en) * | 2020-06-27 | 2020-10-27 | 西安电子科技大学 | Electroencephalogram signal channel selection method and system for emotion recognition and application |
-
2021
- 2021-09-26 CN CN202111127502.7A patent/CN113870901B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107492384A (en) * | 2017-07-14 | 2017-12-19 | 北京联合大学 | A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm |
KR20190102667A (en) * | 2018-02-27 | 2019-09-04 | 광주과학기술원 | Emotion recognition system and method thereof |
CN108899046A (en) * | 2018-07-12 | 2018-11-27 | 东北大学 | A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification |
CN109036468A (en) * | 2018-11-06 | 2018-12-18 | 渤海大学 | Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core |
CN111832438A (en) * | 2020-06-27 | 2020-10-27 | 西安电子科技大学 | Electroencephalogram signal channel selection method and system for emotion recognition and application |
Non-Patent Citations (1)
Title |
---|
王光艳;张培玟;于宝芸;: "基于SVM多分类算法的汉语语音情感信息智能识别", 电子元器件与信息技术, no. 07, 20 July 2020 (2020-07-20) * |
Also Published As
Publication number | Publication date |
---|---|
CN113870901B (en) | 2024-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shahin et al. | Emotion recognition using hybrid Gaussian mixture model and deep neural network | |
Basu et al. | A review on emotion recognition using speech | |
Schuller et al. | Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture | |
Mannepalli et al. | Emotion recognition in speech signals using optimization based multi-SVNN classifier | |
Deshwal et al. | A language identification system using hybrid features and back-propagation neural network | |
Yeh et al. | Segment-based emotion recognition from continuous Mandarin Chinese speech | |
CN110827857B (en) | Speech emotion recognition method based on spectral features and ELM | |
Semwal et al. | Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models | |
Palo et al. | Efficient feature combination techniques for emotional speech classification | |
Nawas et al. | Speaker recognition using random forest | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
Kawade et al. | Speech Emotion Recognition Using 1D CNN-LSTM Network on Indo-Aryan Database | |
Alrehaili et al. | Arabic speech dialect classification using deep learning | |
Nanduri et al. | A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data | |
Prakash et al. | Analysis of emotion recognition system through speech signal using KNN & GMM classifier | |
Chaudhary et al. | Feature selection and classification of indian musical string instruments using svm | |
Aggarwal et al. | Application of genetically optimized neural networks for hindi speech recognition system | |
CN113870901B (en) | SVM-KNN-based voice emotion recognition method | |
Mathur et al. | A study of machine learning algorithms in speech recognition and language identification system | |
Hasan | Bird Species Classification And Acoustic Features Selection Based on Distributed Neural Network with Two Stage Windowing of Short-Term Features | |
Mangalam et al. | Emotion Recognition from Mizo Speech: A Signal Processing Approach | |
Gade et al. | Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments | |
Praksah et al. | Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier | |
Ashrafidoost et al. | Recognizing Emotional State Changes Using Speech Processing | |
Lakra et al. | Automated pitch-based gender recognition using an adaptive neuro-fuzzy inference system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |