CN113870901B - SVM-KNN-based voice emotion recognition method - Google Patents

SVM-KNN-based voice emotion recognition method Download PDF

Info

Publication number
CN113870901B
CN113870901B CN202111127502.7A CN202111127502A CN113870901B CN 113870901 B CN113870901 B CN 113870901B CN 202111127502 A CN202111127502 A CN 202111127502A CN 113870901 B CN113870901 B CN 113870901B
Authority
CN
China
Prior art keywords
emotion
sample
svm
classification
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111127502.7A
Other languages
Chinese (zh)
Other versions
CN113870901A (en
Inventor
王海
路璐
侯宇婷
冯毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NORTHWEST UNIVERSITY
Original Assignee
NORTHWEST UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NORTHWEST UNIVERSITY filed Critical NORTHWEST UNIVERSITY
Priority to CN202111127502.7A priority Critical patent/CN113870901B/en
Publication of CN113870901A publication Critical patent/CN113870901A/en
Application granted granted Critical
Publication of CN113870901B publication Critical patent/CN113870901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the following steps of firstly, preprocessing an original voice signal; step two, performing voice enhancement processing by a microphone array delay alignment method; step three, performing feature extraction on the processed data based on BN-DNN of the SHL structure; step four, selecting the extracted characteristics based on a fuzzy set theory method; and fifthly, performing emotion recognition by adopting an optimized SVM-KNN method. Through the method and the device, a user can obtain higher voice emotion classification accuracy, the problem of limited optimization under a large-scale training sample is avoided, and the SVM classification accuracy and the recognition speed are improved. On the other hand, the SVM-KNN concept provided by the invention can also be applied to other fields of voice recognition, such as the dialect classification field, and provides reference for classification and recognition based on voice signals.

Description

SVM-KNN-based voice emotion recognition method
Technical Field
The invention relates to voice emotion recognition, in particular to a voice emotion recognition method based on SVM-KNN.
Background
In the current speech emotion recognition method, a support vector machine (SVM, support vector machine) is proved to be a relatively effective classification tool, but under the condition of high emotion confusion degree, accurate recognition is still difficult to carry out by using the SVM.
For a long time, the study of emotion has been carried out by experts in the physiological and psychological fields. With the rapid development of artificial intelligence, emotion research in human-computer interaction has attracted great interest to a large number of experts. In human-computer interaction, people hope to communicate with a machine more naturally, so that the machine is required to understand the emotion of the human, and therefore classification and recognition of the emotion are particularly important. In human communication, language contains rich information, so machines can use language to classify and identify emotions. Experts have made a great deal of research and analysis in the aspect of speech emotion classification and recognition, including building a speech emotion database, extracting emotion characteristics, classifying and recognizing methods and the like. In order to improve the recognition rate of the voice emotion, the former people carry out improvement research on each link, but a unified system is not available, and the recognition rate is not very high. MFCCs have been used in the past as identification features, but have not been further processed prior to identification, resulting in significant amounts of redundant information affecting the identification. In order to eliminate such an influence and improve the recognition rate, selection of an appropriate classifier is an important point of study. In order to improve the emotion recognition rate and accurately process emotion characteristics, it is important to select a proper classification method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a voice emotion recognition method based on SVM-KNN, which is characterized in that voice enhancement processing is carried out by a microphone array delay alignment method, BN-DNN based on an SHL structure is adopted for feature extraction, a fuzzy set theory based method is adopted for feature selection, and then an optimized SVM-KNN method is adopted for emotion recognition. Finally, a voice emotion recognition method with high precision and low calculation load is formed.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a voice emotion recognition method based on SVM-KNN comprises the following steps of different voice signal preprocessing modes, specific feature extraction, fuzzy feature selection and SVM-KNN support vector machine classification:
(1) Preprocessing an input voice signal; the preprocessing comprises pre-emphasis filtering and windowing framing, wherein the pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95, and the frame length of the windowing framing is 26ms;
(2) The data of different microphone channels are delayed and aligned by using a microphone array solution to realize the localization of sound sources and improve the audio quality:
1) A nested microphone array structure consisting of 9 microphones, in effect a 4-set linear microphone array, consists of 5 equidistant (2.5 am.5am, 10am, 20 am) microphones, respectively, thus ensuring that the frequency domain range of the recorded speech signal is 3003400Hz.
2) Meanwhile, the impulse response using mage model of the hypothetical room with the sound field as far field is met by taking account of the proportional relation between the microphone spacing and the distance from the person to the microphone array
(3) Adopting BN-DNN based on SHL structure to extract the characteristics, wherein the characteristic extraction process is as follows:
1) In the experiment, 5 hidden layers are set in a BN-DNN model, the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons in the other hidden layers is 1024; the input data is a 40-dimensional MFCC bottleneck characteristic of consecutive 11 frames,
2) The neurons of the input layer are all set to 440 (40 x 11). The DNN network structure is set to 440- [ 1024-1024-1024-1024-1024 ] -440.
3) And determining the number of neurons in each group of optimal parameters and the sparse group overlapping coefficient alpha. The experimental settings were 64, 128, 256, and the overlap coefficient α was 0%, 20%, 30%, 40%.
4) The sparsity of the network is measured by using the proportion of activation probability h equal to 0 in the neurons, and the sparsity is defined as:
Wherein D represents the number of neurons in a layer, hAi =1, 2, …, D) represents the neurons, and a greater sparsity represents a greater sparsity of the neurons in the hidden layer, i.e., a greater number of neurons with a weight of o. For each model, training the model by using a training set to obtain the activation probability of each layer of neurons, then substituting the training set to calculate the sparsity of the layer, finally calculating the average value of the sparsity of all hidden layers as the sparsity of the whole neural network, and finally extracting the voice bottleneck characteristics.
(4) And selecting the characteristics by adopting a method based on fuzzy set theory:
1) In the N-dimensional space R, for the problem of class c, the training sample set is X=X, X is shown in the specification, …, XN is the number of samples, and for the sample X to be measured, K values of K neighbors of the sample to be measured are determined first
2) Determining the distance between the sample to be tested and all training samples, wherein Euclidean distance is selected to be adopted:
Ordering the N distances
d(1)≤d(2)≤d(3)≤.≤d(k)≤d(k+1)≤……≤d(N)
D (1) is the distance between the sample to be measured and the K nearest neighbors.
4) Calculating class membership of the sample to be tested according to formula (1), wherein m is a fuzzy weight adjustment factor; if ui (x) =max { un (x) }, then determine that x belongs to class i.
(5) And carrying out emotion recognition by adopting an optimized SVM-KNN method:
1) Assuming that the membership degree of each sample belongs to the class is s q, the blurred input sample is S={(x1,y1,s1),(x2,y2,s2)……(xi,yi,si)},, wherein x i∈R,yi∈{1,-1},σ≤si is less than or equal to 1, and the positive number s i with the sigma being sufficiently small indicates the degree that the ith sample belongs to the positive class.
2) Introducing transformation 0:R-F under nonlinear condition, mapping a sample from an input space R to a high-dimensional feature space F, determining an optimal classification hyperplane in the high-dimensional feature space by utilizing a structural risk minimization principle and a classification interval maximization idea, and solving the FSVM optimal hyperplane problem can be converted into the following optimization problem
ξi≥0,i=1,^…,1.
3) Build lagrange function
Wherein μi >0, βi >0 is Lagrange multiplier, C0>0 is penalty factor, and w is weight coefficient of linear classification function y.
4) The following dual programming problem is obtained.
0≤μi≤siC0,i=1…l. (5)
Where k (x i,xj) is a kernel function, corresponding to sample x of μ=0, taking into account the KKT condition; is a sample that can be correctly classified, i.e., a non-support vector. Corresponding toIs the support vector on the boundary, i.e. the sample xi is located in the correctly partitioned area on the interval boundary.
Drawings
FIG. 1 is a flow chart of speech signal enhancement according to the present invention;
FIG. 2 is a flow chart of a multi-level SVM classifier of the present invention;
FIG. 3 shows the SVM-KNN classification step according to the invention.
Fig. 4 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1,2 and 3, the speech emotion recognition method based on the SVM-KNN comprises the following steps of different speech signal preprocessing modes, multiple feature extraction, fuzzy KNN feature selection and SVM support vector machine-K neighbor classification:
(1) Preprocessing the original data; the method of pre-emphasis, framing, windowing and endpoint detection is used for preprocessing the original data.
1) The signal s (1) becomes sw (n) after being added to the bed, and the formula is as follows: sw (n) =s (n) ×w (n)
2) Filtering the data by using a filter, wherein the pre-emphasis coefficient alpha of the pre-emphasis filter is 0.95
(2) And designing a corresponding microphone array, and carrying out time delay alignment on data of different microphone channels by utilizing a microphone array solution to realize sound source positioning and improve audio quality.
1) A nested microphone array structure consisting of 9 microphones, 4 sets of linear microphone arrays, each consisting of 5 equidistant (2.5 am, 5am, 10am, 20 am) microphones, ensures that the frequency domain range of the recorded speech signal is 3003400Hz.
3) A mage model is used to compromise the proportional relationship between microphone spacing and human-to-microphone array distance to conform to the impulse response of a hypothetical conditioned room with a far field sound field.
(3) And performing feature extraction on the processed data based on BN-DNN of the SHL structure.
1) Setting 5 hidden layers in a BN-DNN model, wherein the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of each hidden layer is 1024; the 40-dimensional data of consecutive 11 frames is taken as bottleneck characteristic of the MFCC,
2) The DNN network structure is 440- [1024-1024-1024-1024-1024] -440.
3) And determining the number of neurons in each group of optimal parameters and the sparse group overlapping coefficient alpha. The experimental settings were 64, 128, 256, and the overlap coefficient α was 0%, 20%, 30%, 40%.
4) The greater the probability of activation in a neuron, the more sparse the representative network, the sparsity defined as:
And finally, calculating an average value of sparsity of all hidden layers as sparsity of the whole neural network, and finally extracting voice bottleneck characteristics.
(4) And selecting the extracted features based on a fuzzy set theory method:
1) The energy, short-time amplitude, short-time zero-crossing rate and pitch frequency characteristics are extracted using a function.
2) And forming the extracted characteristic parameters into characteristic vectors, and taking the characteristic vectors as the input of the fuzzy set.
3) For C emotion recognition, counting the average value of the same characteristic parameter under C different emotion states for a training sample set X, and marking the average value as M; ; (i=1, 2..c, j=1, 2..n, N is the number of emotion feature parameters), each feature parameter Mjm of each sentence of speech sample in each emotion state (N is a sample in the emotion state, n=1 is a first sentence, and so on) is normalized by a normalization formula shown in the formula
Calculating the dispersion of the characteristic parameters under a certain specific emotion:
5) After the dispersion of each characteristic parameter under each emotion is obtained, the contribution degree of each characteristic parameter under each emotion is calculated according to the dispersion degree:
and weighting the contribution degree of the emotion characteristic parameter and the Euclidean distance when distinguishing the sample to be classified by using the fuzzy K nearest neighbor.
Finally, the feature which contributes most to emotion recognition is extracted.
(5) And carrying out emotion recognition by adopting an optimized SVM-KNN method:
1) And constructing a voice emotion recognition model according to the multi-level classification strategy.
2) The emotion confusion degree is the similarity degree of two different emotions.
Defining an ith emotion Bi and a jth emotion B; the confusion degree of the system is I ij, the specific meaning is that the average value of the probability that the ith emotion is misjudged as the jth emotion and the probability that the jth emotion is misjudged as the ith emotion, and the mathematical expression is as follows:
wherein x is test data, and t is identification result corresponding to the test data x.
3) The construction algorithm of the multi-level classification comprises the following specific steps:
a. Calculating a voice emotion recognition confusion matrix by using a traditional Support Vector Machine (SVMD) method;
b. constructing a first-level classifier, setting the probability P1 of the first-level classifier, classifying the emotion with the confusion degree exceeding the probability P1 into one type, namely classifying a and b into one group if I ab>Pl,Icd>P1 is adopted, and classifying c and d into one group; if I ab>Pl and I bc>Pl, then a, b, c are grouped together.
On the basis of the completion of the construction of the upper classifier, when constructing the second classifier, the second classifier probability P 2 is set again, and if I a>P2 and I bc>P2, a.b.c is also classified as a group. Here, when the first-stage classifier is designed, P l is set to be 10%, then each classifier probability is sequentially increased by 2% on the basis of the probability of the upper-stage classifier, for example, P' of the second-stage classifier is based on the probability of the first-stage classifier P l, and then sequentially increased by 2% on the basis, namely, sequentially increased by 10%, 12%, 14%, 16% and the like, and so on;
c. calculating the emotion confusion degree of the ungrouped emotion states according to the formula (1.1), and turning to the step b to classify the ungrouped emotion states into the existing groups or the independent groups;
d. all four emotions are correctly grouped and ended.
Examples
Step 1: preprocessing the original data, comprising the following steps:
(1) In this example, EMO-DB dataset was used, which was obtained from a German emotion voice library recorded by Berlin university, by 7 emotion (neutral/nertral, angry/anger, fear/fear, happy/joy, sad/sadness, aversion/disgust, boring/boredom) simulations of 10 sentences (5 long 5 short 5) by 10 actors (5 male 5 female), containing 800 sentences corpus in total, sampling rate 48kHz (post compression to 16 kHz), 16bit quantization. The selection of the corpus text complies with the principles of semantic neutrality and no emotional tendency, and is a daily spoken language style without excessive written modification. Recording of speech is done in a professional studio, asking the actors to enhance the realism of the emotion by recall of their own real experiences or experience for the emotional incubation before a certain emotion is deduced. After 20 participants (10 men and 10 women) were subjected to a hearing test, a hearing recognition rate of 84.3% was obtained.
After the data set is subjected to the hearing test, 233 sentences of male emotion sentences and 302 sentences of female emotion sentences are reserved, and 535 sentences are reserved. The sentence content comprises 5 short sentences and 5 long sentences of daily life expression, has higher emotion degrees of freedom and does not comprise a specific emotion tendency. 16kHZ sampling, 16bit quantization, and saving the file in WAV format are used.
(2) Pretreatment: and setting a pre-emphasis coefficient, performing pre-emphasis processing on the voice signal, and setting a windowed framing frame length to perform framing processing on the voice signal.
Step 2: and selecting a specific channel from the preprocessed data, and delaying and aligning the data of different microphone channels by using a microphone array solution to realize the positioning of a sound source and improve the audio quality.
(3) And performing feature extraction by adopting BN-DNN based on an SHL structure. Setting the layer number and each layer parameter, and constructing the BN-DNN neural network. Training a neural network, inputting the MFCC, and finally outputting the voice bottleneck characteristics.
(4) And selecting the extracted multiple features as fuzzy features. And calculating the class membership of each feature according to the extracted features and judging that the x features belong to the ith class until all samples are processed.
(5) And carrying out emotion recognition by adopting an optimized SVM-KNN method. And calculating the emotion confusion degree, and classifying the emotion confusion degree into corresponding groups.
(6) And counting the accuracy rate to obtain a final result.

Claims (1)

1. A voice emotion recognition method based on SVM-KNN is characterized by comprising the following steps:
(1) Preprocessing the original data; preprocessing the original data by using a pre-emphasis, framing, windowing and endpoint detection method;
1) The pre-emphasis technology is utilized to improve the high-frequency part, so that the frequency spectrum of the signal is flattened, and the frequency spectrum analysis or the channel parameter analysis is facilitated;
2) Framing the voice signal; in order to make the transition between frames smooth, keep continuity, use the method of the overlapping segmentation, intercept a section every time move, thus obtain as many frames as possible, facilitate the short-term analysis;
3) Multiplying s (n) by a certain window function w (n), thereby forming a windowed speech signal sw (n) =s (n) ×w (n);
4) Accurately finding out a starting point and an ending point of a voice signal from a section of voice signal, so that an effective voice signal and an useless noise signal are separated;
(2) Designing a corresponding microphone array, and carrying out time delay alignment on data of different microphone channels by utilizing a microphone array solution to realize sound source positioning and improve audio quality;
1) Estimating a noise power spectrum of the input speech signal using a first order recursive smoothing method;
2) Calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the voice signal with noise;
3) Smoothing the noisy speech signal to obtain a smoothed power spectrum S (x, k) of the signal, and searching the minimum value of the smoothed output signal to obtain S min (lambda, k);
4) Solving the probability I (x, k) of the existence of the voice signal, carrying out secondary smoothing and minimum value searching according to the probability, and calculating to obtain the non-existence probability q (lambda, k) of the voice signal;
5) According to the following
Calculating the probability of speech signal existence
6) Updating time-varying smoothing parameters and smoothing noise power spectrums;
(3) Performing feature extraction on the processed data based on BN-DNN of the SHL structure;
1) Firstly, 39-dimensional MFCC characteristics (13++ delta) are extracted from VYSTADIAL _cz for 1 hour, a triphone GMM model is trained, and forced alignment is carried out;
2) Training a triphone GMM acoustic model (9 frame stitching of 13-dimensional MFCC features, LDA down to 40 dimensions) based on linear discriminant analysis (LINEAR DISCRIMINANT ANALYSIS, LDA) and maximum likelihood linear transformation (Maximum likelihood linear transform, MLLT), wherein the number of Gaussian mixture elements of the model is 19200;
3) Then, performing speaker self-adaptive training (SPEAKER ADAPTIVE TRAINING, SAT) by utilizing a Feature space maximum likelihood linear regression (Feature-space maximum likelihood linearregression, fMLLR) technology, so as to form a GMM acoustic model of LDA+ MLLT +SAT;
4) Obtaining a training target of a softmax layer in BN-DNN by a mode of forced alignment of the model; the training features of DNN have the characteristics of fbanks with good using effect, firstly, the fbanks features of 40 dimensions are extracted, 11 frames are spliced (5-1-5), and the obtained supervectors are used as the input features of DNN;
5) Performing 10 rounds of RBM pre-training on each hidden layer (comprising a BN layer), then performing fine adjustment on global parameters by utilizing a BP algorithm, and finally extracting three major characteristics of rhythm, tone quality and spectrum;
(4) Selecting the extracted features based on a fuzzy set theory method;
1) Analyzing short-time energy, short-time amplitude, short-time zero-crossing rate and pitch frequency of the extracted features by a function TimePara () and extracting the pitch frequency by a function FunFre ();
2) After the short-time energy, the short-time amplitude, the short-time zero-crossing rate and the fundamental tone frequency are respectively extracted, the extracted characteristic parameters form characteristic vectors which are used as the input of the fuzzy set.
3) For C emotion recognition, counting the average value of the same characteristic parameter under C different emotion states for a training sample set X, and marking the average value as M; ; (i=1, 2;;;;;; + & lt, & gt, & lt, & gtand; + & lt, & gtand + & lt, & gtare the number of emotion feature parameters) each feature parameter Mjm (N is the sample in each emotion state, N is the first sentence, n=1 is the sample in the emotion state, and so on) is normalized by a normalization formula
4) Then calculating the dispersion of the characteristic parameters under a certain specific emotion:
5) Calculating the contribution degree of each characteristic parameter under each emotion according to the dispersion degree after calculating the dispersion degree of each characteristic parameter under each emotion; contribution u ij of characteristic parameter θ i:
Weighting contribution degree and Euclidean distance of emotion characteristic parameters when distinguishing samples to be classified by using fuzzy K neighbors;
Finally extracting the characteristics with the largest contribution to emotion recognition;
(5) Performing emotion recognition on the voice features by adopting an optimized SVM-KNN method based on the extracted features;
1) Decomposing 6 emotion classification problems, establishing a multi-level SVM classifier based on a decision tree, identifying one emotion by the SVM of each level for a sample set, identifying the rest sample set by the SVM of the next level, gradually decreasing as shown in figure 1, and finally obtaining the emotion classification by the leaf nodes of the decision tree;
2) For the wrong division samples generated near the SVM hyperplane, a KNN algorithm is utilized for combination, and an SVM-KNN combined classification model is constructed to improve the accuracy of the SVM; SVM-KNN classification:
① The method comprises the steps that samples in an initial training set are marked, a small number of samples are selected randomly from the training set, a small sample training set is constructed, and each emotion in the initial training sample set is guaranteed to at least contain one sample;
② Obtaining a weak SVM classifier of emotion A according to the initial training sample, and then determining an optimal classification hyperplane, a support vector set T, a coefficient W of a classification decision function and a constant b;
③ Selecting a sample with fuzzy classification and low accuracy near the hyperplane from the A-class emotion, calculating the similarity of the sample with all the samples of the non-A-class emotion, selecting n most probable samples of the non-A-class emotion, and marking the samples as a sample set A; selecting one sample from non-A samples, calculating the similarity between the sample and all samples of A emotion, selecting n samples most likely to be the A emotion, and marking the samples as a sample set B;
④ The samples in A and B are points near the hyperplane, and the sample x in A, B is taken as a human decision function
g(x)=∑iyiaiK(xi,xj)+b (6);
⑤ If |v| > e, the classification accuracy and the reliability of the sample points by the SVM are high, so that the class to which the sample points belong can be determined by a decision function f (x) =sgng (x);
⑥ If |v| < e, the sample point is near the hyperplane, the classification reliability is low and the classification is easy to be misplaced, so that the classification of the sample x is determined by a KNN 4 method; the support vector set T of class A and non-class A is used as a training sample, the distance d (x, x i) between each vector in the samples x and T is calculated, the class to which the nearest vector belongs is used as the class of the sample x,
Wherein x; is a support vector; k () is a first order polynomial kernel function; the range of the threshold value e is 0,1, the specific value can be dynamically adjusted according to the experimental result, the initial value is generally set to be 1, and if the adjustment is 0, the algorithm is a traditional SVM algorithm;
⑦ Placing the sample obtained by SVM classification and the sample of KNN classification into an initial training set to expand the sample, so that a new SVM2 is trained on the basis of the expanded training set;
⑧ Iterating until all samples in the training set are added into the initial training set, and stopping iterating; obtaining an SVM classifier with high class A emotion classification precision by utilizing the final training set;
⑨ The first-level SVM classifier in the trained decision tree is then used as a trainer of the next-level SVM by using a sample set which is not A-class; and training step by step to obtain SVM classifiers corresponding to each emotion category.
CN202111127502.7A 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method Active CN113870901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111127502.7A CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111127502.7A CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Publications (2)

Publication Number Publication Date
CN113870901A CN113870901A (en) 2021-12-31
CN113870901B true CN113870901B (en) 2024-05-24

Family

ID=78994361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111127502.7A Active CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Country Status (1)

Country Link
CN (1) CN113870901B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492384A (en) * 2017-07-14 2017-12-19 北京联合大学 A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm
CN108899046A (en) * 2018-07-12 2018-11-27 东北大学 A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
CN109036468A (en) * 2018-11-06 2018-12-18 渤海大学 Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core
KR20190102667A (en) * 2018-02-27 2019-09-04 광주과학기술원 Emotion recognition system and method thereof
CN111832438A (en) * 2020-06-27 2020-10-27 西安电子科技大学 Electroencephalogram signal channel selection method and system for emotion recognition and application

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492384A (en) * 2017-07-14 2017-12-19 北京联合大学 A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm
KR20190102667A (en) * 2018-02-27 2019-09-04 광주과학기술원 Emotion recognition system and method thereof
CN108899046A (en) * 2018-07-12 2018-11-27 东北大学 A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
CN109036468A (en) * 2018-11-06 2018-12-18 渤海大学 Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core
CN111832438A (en) * 2020-06-27 2020-10-27 西安电子科技大学 Electroencephalogram signal channel selection method and system for emotion recognition and application

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于SVM多分类算法的汉语语音情感信息智能识别;王光艳;张培玟;于宝芸;;电子元器件与信息技术;20200720(07);全文 *

Also Published As

Publication number Publication date
CN113870901A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Chen et al. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction
Shahin et al. Emotion recognition using hybrid Gaussian mixture model and deep neural network
Jancovic et al. Bird species recognition using unsupervised modeling of individual vocalization elements
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN112581979A (en) Speech emotion recognition method based on spectrogram
Ghai et al. Emotion recognition on speech signals using machine learning
Sefara The effects of normalisation methods on speech emotion recognition
Renjith et al. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers
Ribeiro et al. Binary neural networks for classification of voice commands from throat microphone
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Nawas et al. Speaker recognition using random forest
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
Kaur et al. An efficient speaker recognition using quantum neural network
Gade et al. A comprehensive study on automatic speaker recognition by using deep learning techniques
Konangi et al. Emotion recognition through speech: A review
Moumin et al. Automatic Speaker Recognition using Deep Neural Network Classifiers
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Jiang et al. Research on voiceprint recognition of camouflage voice based on deep belief network
Prakash et al. Analysis of emotion recognition system through speech signal using KNN & GMM classifier
CN113870901B (en) SVM-KNN-based voice emotion recognition method
Aggarwal et al. Application of genetically optimized neural networks for hindi speech recognition system
Al-Talabani Automatic speech emotion recognition-feature space dimensionality and classification challenges
Gade et al. Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments
Mangalam et al. Emotion Recognition from Mizo Speech: A Signal Processing Approach
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant