CN113870901A - Voice emotion recognition method based on SVM-KNN - Google Patents

Voice emotion recognition method based on SVM-KNN Download PDF

Info

Publication number
CN113870901A
CN113870901A CN202111127502.7A CN202111127502A CN113870901A CN 113870901 A CN113870901 A CN 113870901A CN 202111127502 A CN202111127502 A CN 202111127502A CN 113870901 A CN113870901 A CN 113870901A
Authority
CN
China
Prior art keywords
emotion
sample
svm
training
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111127502.7A
Other languages
Chinese (zh)
Other versions
CN113870901B (en
Inventor
王海
路璐
侯宇婷
冯毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202111127502.7A priority Critical patent/CN113870901B/en
Publication of CN113870901A publication Critical patent/CN113870901A/en
Application granted granted Critical
Publication of CN113870901B publication Critical patent/CN113870901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech emotion recognition method based on SVM-KNN comprises the following steps that firstly, an original speech signal is preprocessed; secondly, performing voice enhancement processing by a microphone array delay alignment method; thirdly, extracting the characteristics of the processed data based on the BN-DNN with the SHL structure; step four, selecting the extracted features based on a fuzzy set theory method; and fifthly, emotion recognition is carried out by adopting an optimized SVM-KNN method. By the method and the device, a user can obtain higher speech emotion classification accuracy, the problem of limited optimization under large-scale training samples is avoided, and SVM classification accuracy and recognition speed are improved. On the other hand, the SVM-KNN thought provided by the invention can also be applied to other fields of voice recognition, such as dialect classification field, and provides reference for classification and recognition based on voice signals.

Description

Voice emotion recognition method based on SVM-KNN
Technical Field
The invention relates to voice emotion recognition, in particular to a voice emotion recognition method based on SVM-KNN.
Background
In the current speech emotion recognition method, a Support Vector Machine (SVM) is proved to be a relatively effective classification tool, but under the condition of large emotion confusion degree, the SVM is still difficult to perform accurate recognition.
For a long time, studies on emotions have been carried out by experts in the fields of physiology and psychology. With the rapid development of artificial intelligence, emotional research in human-computer interaction arouses great interest of experts. In human-computer interaction, people hope that people can communicate with machines more naturally, and the machines are required to understand human emotions, so that the emotion classification and identification are particularly important. In human communication, languages contain rich information, so machines can classify and recognize emotions using languages. Experts do a lot of research and analysis on the speech emotion classification and identification aspects, including establishing a speech emotion database, extracting emotion characteristics, classifying and identifying methods and the like. In order to improve the recognition rate of speech emotion, the predecessors have conducted improvement research on each link, but there is no unified system, and the recognition rate is not very high. MFCC has been used as a recognition feature in the past, but it is not further processed before recognition, resulting in a large amount of redundant information affecting recognition. In order to eliminate such an influence and improve the recognition rate, selection of an appropriate classifier is a key point of research. In order to improve the emotion recognition rate and correctly process the emotion characteristics, it is important to select a proper classification method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a speech emotion recognition method based on SVM-KNN, which performs speech enhancement processing by a microphone array delay alignment method, adopts BN-DNN based on an SHL structure to perform feature extraction, adopts a method based on fuzzy set theory to perform feature selection, and then adopts an optimized SVM-KNN method to perform emotion recognition. Finally, a speech emotion recognition method with high precision and low calculation load is formed.
In order to achieve the purpose, the invention adopts the technical scheme that:
a speech emotion recognition method based on SVM-KNN, different speech signal preprocessing modes, specific feature extraction, fuzzy feature selection and SVM-KNN support vector machine classification, comprises the following steps:
(1) preprocessing an input voice signal; the preprocessing comprises pre-emphasis filtering and windowing framing, wherein a pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95, and the frame length of the windowing framing is 26 ms;
(2) the data of different microphone channels are delayed and aligned by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality:
1) a nested microphone array structure consisting of 9 microphones, in fact 4 sets of linear microphone arrays, each consisting of 5 equally spaced (2.5am.5am, 10am, 20am) microphones, ensures that the frequency domain range of the recorded speech signal is 3003400Hz.
2) Simultaneously, a large model is used for considering that the proportional relation between the microphone distance and the distance from a person to the microphone array conforms to the impulse response of the room with the assumed condition that the sound field is a far field
(3) And (2) performing feature extraction by using the BN-DNN based on the SHL structure, wherein the feature extraction process is as follows:
1) in the experiment, a BN-DNN model is provided with 5 hidden layers, the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of the rest hidden layers is 1024; the input data is a 40-dimensional MFCC bottleneck feature of 11 consecutive frames,
2) the neurons of the input layer were all set to 440(40x 11). The DNN network structure is set to 440- [ 1024-440.
3) And determining the number of the neurons in each group and the sparse group overlapping coefficient alpha of the optimal parameters. The experimental settings are 64, 128, 256, and the overlap factor α is 0%, 20%, 30%, 40%.
4) The sparsity of the network is measured by the proportion of the activation probability h of the neurons equal to 0, and the sparsity is defined as:
Figure BDA0003279136000000031
wherein D represents the number of neurons in a layer, hAi ═ 1,2, …, D) represents neurons, and a larger sparseness represents a more sparse number of neurons in the hidden layer, that is, a larger weight value. For each model, firstly, the model is trained by using a training set to obtain the activation probability of each layer of neuron, then, the activation probability is substituted to calculate the sparsity of the layer, finally, the average value of the sparsity of all hidden layers is calculated to be used as the sparsity of the whole neural network, and finally, the speech bottleneck characteristic is extracted.
(4) And (3) selecting features by adopting a method based on a fuzzy set theory:
1) in N-dimensional space R, for the class c problem, the training sample set is X ═ X, X:, …, XN }, N is the number of samples, and for the sample X to be measured, K values of K neighbors of the sample X to be measured are firstly determined
2) Determining the distance between the sample to be detected and all training samples, wherein the Euclidean distance is selected to be adopted:
Figure BDA0003279136000000041
sorting the N distances
d(1)≤d(2)≤d(3)≤.≤d(k)≤d(k+1)≤……≤d(N)
Wherein d (1),. d (K) is the distance between the sample to be measured and the K nearest neighbors.
4) Calculating the class membership of the sample to be detected according to the formula (1), wherein m is a fuzzy weight adjusting factor, and n is 1,2. If ui (x) max { un (x), then judging that x belongs to the ith class, and repeating the algorithm until all samples to be detected are processed.
(5) And (3) emotion recognition is carried out by adopting an optimized SVM-KNN method:
1) let the membership degree of each sample belonging to the class be sqThen the blurred input sample is S { (x)1,y1,s1),(x2,y2,s2)……(xi,yi,si) In which xi∈R,yi∈{1,-1},σ≤s i1 or less, sigma being a sufficiently small positive number siIndicating the degree to which the ith sample belongs to the positive class.
2) Introducing transformation 0: R → F under the nonlinear condition, mapping the sample from an input space R to a high-dimensional feature space F, and determining an optimal classification hyperplane in the high-dimensional feature space by using a structure risk minimization principle and a classification interval maximization idea, so that the problem of solving the FSVM optimal hyperplane can be converted into the following optimization problem
Figure BDA0003279136000000042
Figure BDA0003279136000000051
ξi≥0,i=1,^…,1.
3) Establishing a lagrange function
Figure BDA0003279136000000052
Wherein μ i >0, β i >0 are Lagrange multipliers, C0>0 are penalty factors, and w is a weight coefficient of the linear classification function y.
4) The following dual planning problem results.
Figure BDA0003279136000000053
Figure BDA0003279136000000054
0≤μi≤siC0,i=1…l. (5)
Wherein k (x)i,xj) For the kernel function, consider the KKT condition, corresponding to sample x where μ ═ 0; for samples that can be correctly classified, i.e. not supportedAnd (5) vector quantity. Correspond to
Figure BDA0003279136000000055
Is the support vector on the boundary, i.e. the correct partition where the sample xi is located on the boundary of the interval.
Drawings
FIG. 1 is a flow chart of speech signal enhancement according to the present invention;
FIG. 2 is a flow chart of the multi-level SVM classifier of the present invention;
fig. 3 shows the SVM-KNN classification steps of the present invention.
FIG. 4 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1,2 and 3, a speech emotion recognition method based on SVM-KNN, which comprises the following steps of different speech signal preprocessing modes, multiple feature extraction, fuzzy KNN feature selection and SVM support vector machine-K nearest neighbor classification:
(1) preprocessing original data; the method comprises the steps of pre-emphasis, framing, windowing and end point detection.
1) After the signal s (1) is added to the bed, the signal s (1) becomes sw (n), and the formula is as follows: sw (n) ═ s (n) × w (n)
2) The data is filtered by a filter, and the pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95
(2) And designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality.
1) A nested microphone array structure consisting of 9 microphones and 4 groups of linear microphone arrays consisting of 5 microphones with equal distance (2.5am, 5am, 10am and 20am) respectively, thereby ensuring that the frequency domain range of the recorded voice signal is 3003400Hz.
3) The method uses a large model to take into account that the proportional relationship between the microphone spacing and the distance from a person to the microphone array conforms to the impulse response of a room with the assumed condition that the sound field is far field.
(3) And performing feature extraction on the processed data based on the BN-DNN of the SHL structure.
1) Setting 5 hidden layers in a BN-DNN model, wherein the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of each hidden layer is 1024; with 40-dimensional data of 11 consecutive frames as a bottleneck feature of the MFCC,
2) the DNN network structure is 440- [ 1024-440.
3) And determining the number of the neurons in each group and the sparse group overlapping coefficient alpha of the optimal parameters. The experimental settings are 64, 128, 256, and the overlap factor α is 0%, 20%, 30%, 40%.
4) The larger the activation probability in the neuron is, the more sparse the representative network is, and the sparsity is defined as:
Figure BDA0003279136000000071
and finally, calculating the average value of all the hidden layer sparsity as the sparsity of the whole neural network, and finally extracting the voice bottleneck characteristic.
(4) And (3) selecting the extracted features based on a fuzzy set theory method:
1) the energy, short-term amplitude, short-term zero-crossing rate and pitch frequency features are extracted using a function.
2) And forming a feature vector by using the extracted feature parameters as input of a fuzzy set.
3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i 1,2.. C, j 1,2.. N, N are the number of emotion feature parameters), and then, each feature parameter Mjm (N is a sample in the emotion state, N1 is a first sentence, and so on) of each speech sample in each emotion state is normalized, wherein the normalization formula is as follows
Figure BDA0003279136000000081
Calculating the dispersion of the characteristic parameter under a certain emotion:
Figure BDA0003279136000000082
5) after the dispersion of each characteristic parameter under each emotion is solved, the contribution degree of each characteristic parameter under each emotion is calculated according to the dispersion degreeiDegree of contribution u ofij:
Figure BDA0003279136000000083
And weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the fuzzy K nearest neighbor is used for distinguishing the sample to be classified.
Figure BDA0003279136000000084
And finally, extracting the features which have the largest contribution to emotion recognition.
(5) And (3) emotion recognition is carried out by adopting an optimized SVM-KNN method:
1) and constructing a speech emotion recognition model according to a multi-stage classification strategy.
2) The emotional confusion degree is the similarity degree of two different emotions.
Defining an ith emotion Bi and a jth emotion B; has a degree of confusion of IijThe specific meaning of the method is that the probability of judging the ith emotion as the jth emotion by mistake and the average value of the probability of judging the jth emotion as the ith emotion by mistake, and the mathematical expression is as follows:
Figure BDA0003279136000000085
wherein x is the test data, and t is the identification result corresponding to the test data x.
3) The construction algorithm of the multilevel classification comprises the following specific steps:
a. calculating a speech emotion recognition confusion matrix by using a traditional Support Vector Machine (SVMD) method;
b. constructing a first-level classifier, setting the probability P1 of the first-level classifier, and classifying the emotion with the confusion degree exceeding the probability P1 into one class, namely if Iab>Pl,Icd>P1Grouping a and b into one group, and grouping c and d into one group; if Iab>PlAnd Ibc>PlThen a, b, c are grouped into one group.
On the basis of the completion of the construction of the superior classifier, when constructing the second classifier, the probability P of the second classifier is set again2If I isa>P2And Ibc>P2A.b.c are also grouped together. This text when designing the first stage classifier PlSet to 10%, then each class of classifier probability is sequentially incremented by 2% based on its upper class of classifier probability, e.g., P' of the second class classifier is set to the first class classifier probability PlOn the basis, sequentially increasing by 2% on the basis, namely sequentially increasing by 10%, 12%, 14%, 16% and the like, and so on;
c. calculating the emotional confusion degree of the non-grouped emotional states according to the formula (1.1), turning to the step b, and classifying the non-grouped emotional states into the existing group or the independent group;
d. and all four emotions are correctly grouped and finished.
Examples
Step 1: the method for preprocessing the raw data comprises the following steps:
(1) in the present embodiment, an EMO-DB data set is used, which is a german emotion voice library recorded at the university of berlin industry, and is obtained by simulating 10 sentences (5 long and 5 short) with 7 emotions (neutral/nertral, angry/anger, fear/fear, happy/joy, sad/sadness, disgust/distust, boring/boredom) by 10 actors (5 men and 5 women), wherein the 10 emotions totally contain 800 sentence corpus, the sampling rate is 48kHz (post-compression to 16kHz), and 16bit quantization is performed. The selection of the corpus text follows the principle of neutral semantics and no emotional tendency, and is a daily spoken style without excessive written language modification. The recording of the voice is finished in a professional recording studio, and actors are required to perform emotion incubation by recalling real experience or experience of the actors before deducing a certain specific emotion, so that the reality of the emotion is enhanced. After 20 participants (10 men and 10 women) had performed the listening discrimination test, 84.3% was obtained.
After the data set is tested by listening and distinguishing, 233 sentences of male emotion sentences and 302 sentences of female emotion sentences are reserved, and 535 sentences are obtained. The sentence content comprises 5 short sentences and 5 long sentences of the daily life phrases, has higher emotional freedom degree and does not comprise a certain specific emotional tendency. And 16kHZ sampling is adopted, 16 bits are quantized, and the file is saved in a WAV format.
(2) Pretreatment: and setting a pre-emphasis coefficient, performing pre-emphasis processing on the voice signal, and setting a windowing framing frame length to perform framing processing on the voice signal.
Step 2: and selecting specific channels of the preprocessed data, and delaying and aligning the data of different microphone channels by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality.
(3) And (3) extracting features by using BN-DNN based on an SHL structure. And setting the number of layers and parameters of each layer, and constructing the BN-DNN neural network. Training a neural network, inputting the MFCC, and finally outputting a voice bottleneck characteristic.
(4) And carrying out fuzzy feature selection on the extracted multiple features. And calculating the class membership degree of each extracted feature according to a formula, and judging that the x feature belongs to the ith class until all samples are processed.
(5) And performing emotion recognition by adopting an optimized SVM-KNN method. And calculating the emotional confusion degree and classifying the emotional confusion degree into the corresponding groups.
(6) And (5) counting the accuracy to obtain a final result.

Claims (1)

1. A speech emotion recognition method based on SVM-KNN is characterized by comprising the following steps:
(1) preprocessing original data; preprocessing original data by using a method of pre-emphasis, framing, windowing and end point detection;
1) the high-frequency part is improved by utilizing a pre-emphasis technology, so that the frequency spectrum of the signal becomes flat, and the frequency spectrum analysis or the vocal tract parameter analysis is facilitated;
2) performing framing processing on a voice signal; in order to make the transition between frames smooth and keep the continuity, an overlapped segmentation method is used, and one segment is intercepted every time the frame is moved, so that the frames as many as possible are obtained, and the short-term analysis is convenient;
3) multiplying s (n) by a window function w (n) to form a windowed speech signal sw (n) s (n) w (n);
4) accurately finding out a starting point and an ending point of a voice signal from a section of voice signal so as to separate an effective voice signal from a useless noise signal;
(2) designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality;
1) estimating a noise power spectrum of an input speech signal by using a first-order recursive smoothing method;
2) calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the voice signal with noise;
3) smoothing the noisy speech signal to obtain a smoothed power spectrum S (x, k) of the signal, and performing minimum value search on the smoothed output signal to obtain Smin(λ,k);
4) Solving the probability I (x, k) of the existence of the voice signal, performing secondary smoothing and minimum value search according to the probability, and calculating to obtain the probability q (lambda, k) of the existence of the voice signal;
5) according to the formula
Figure FDA0003279135990000021
Calculating speech signal presence probability
Figure FDA0003279135990000022
6) Updating time-varying smoothing parameters and a smoothing noise power spectrum;
(3) extracting the characteristics of the processed data based on BN-DNN of an SHL structure;
1) firstly, extracting 39-dimensional MFCC characteristics (13+ delta) from 1-hour Vystdial _ cz, training a three-tone GMM model, and performing forced alignment;
2) training a three-tone GMM acoustic model (13-dimensional MFCC features are spliced for 9 frames, LDA is reduced to 40 dimensions) based on Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transformation (MLLT), wherein the number of Gaussian mixture elements of the model is 19200;
3) then, carrying out Speaker Adaptive Training (SAT) by using a Feature-space maximum likelihood linear regression (fMLLR) technology, thereby forming a GMM acoustic model of LDA + MLLT + SAT;
4) obtaining a training target of a softmax layer in the BN-DNN in a mode of forced alignment of the model; the training feature of the DNN uses fbanks feature with good effect, firstly, 40-dimensional fbanks feature is extracted, 11 frames of splicing (5-1-5) are carried out, and the obtained super vector is used as the input feature of the DNN;
5) performing 10 rounds of RBM pre-training on each hidden layer (including a BN layer), then performing fine adjustment on global parameters by using a BP algorithm, and finally extracting three major characteristics of rhythm, tone quality and spectrum;
(4) selecting the extracted features based on a fuzzy set theory method;
1) analyzing the short-time energy, the short-time amplitude, the short-time zero-crossing rate and the pitch frequency of the extracted features by using the function TimePara (), and extracting the pitch frequency by using the function FunFre ();
2) after short-time energy, short-time amplitude, short-time zero-crossing rate and fundamental tone frequency are respectively extracted, the extracted characteristic parameters are combined into a characteristic vector to be used as the input of a fuzzy set.
3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i;;;;;;;;;;. C, j;. 1, 2;;;;;. N, N are the number of the emotional characteristic parameters), and then normalize the normalization formula as shown in the formula, respectively, for each characteristic parameter Mjm (N is the sample in the emotional state, N;. 1 is the first sentence, and so on) of each speech sample in each emotional state
Figure FDA0003279135990000031
4) Then calculating the dispersion of the characteristic parameter under a certain specific emotion:
Figure FDA0003279135990000032
5) after the dispersion of each characteristic parameter under each emotion is solved, calculating the contribution degree of each characteristic parameter under each emotion according to the dispersion; characteristic parameter thetaiDegree of contribution u ofij:
Figure FDA0003279135990000033
Weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the sample to be classified is judged by using fuzzy K nearest neighbor;
Figure FDA0003279135990000041
finally, extracting features which have the largest contribution to emotion recognition;
(5) performing emotion recognition on the voice features by adopting an optimized SVM-KNN method based on the extracted features;
1) decomposing the 6 emotion classification problems, establishing a multi-stage SVM classifier based on a decision tree, identifying an emotion by the SVM of each stage for a sample set, entering the remaining sample set into the SVM of the next stage for identification, gradually decreasing step by step as shown in FIG. 1, and finally classifying the leaf nodes of the decision tree into the obtained emotion;
2) for misclassified samples generated near the hyperplane of the SVM, a KNN algorithm is utilized for combination, and an SVM-KNN combined classification model is constructed to improve the accuracy of the SVM; SVM-KNN classification step:
firstly, initially considering that samples in a training set are all marked, randomly selecting a small number of samples from the training set, and constructing a small sample training set to ensure that each emotion in the initial training sample set at least comprises one sample;
obtaining a weak SVM classifier of emotion A according to an initial training sample, then determining an optimal classification hyperplane of the weak SVM classifier, and supporting a vector set T, a coefficient W of a classification decision function and a constant b;
selecting a sample from A-class emotion, calculating the similarity of the sample to all samples of non-A-class emotion, selecting n samples most probably not to be A-class emotion, and marking the n samples as a sample set A; selecting one sample from non-A samples, calculating the similarity of the sample with all the samples of A emotion, selecting n samples most likely to be A emotion, and recording the samples as a sample set B;
the samples in A and B are points near the hyperplane, and the x samples in A, B are used as human decision functions
g(x)=∑iyiaiK(xi,xj)+b (6);
If | v | > e, the classification accuracy and reliability of the sample points by the SVM are high, so that the class of the sample points can be determined by the decision function f (x) ═ sgng (x);
(vi) not air<e, the sample point is near the hyperplane, the classification reliability is low and it is easy to be classified, therefore, pass KNN 4]The method determines the category to which the sample x belongs; taking a support vector set T of A type and non-A type as a training sample, and calculating the distance d (x, x) between each vector in the sample x and Ti) The class to which the vector closest to the sample x belongs is taken as the class of the sample x,
Figure FDA0003279135990000051
in the formula, x; is a support vector; k () is a first order polynomial kernel; the range of the threshold e is [0,1], the specific value can be dynamically adjusted according to the experimental result, the initial value is generally set to 1, and if the initial value is adjusted to 0, the algorithm is the traditional SVM algorithm;
seventhly, putting the samples obtained by SVM classification and the samples obtained by KNN classification into an initial training set to expand the samples, and training a new SVM2 on the basis of the expanded training set;
iterating until all samples in the training set are added into the initial training set, and stopping iteration; obtaining an SVM classifier with high classification precision on the A-type emotion by using the final training set;
ninthly, training a first-level SVM classifier in the decision tree at the moment, and then using a non-A-type sample set as a trainer of a next-level SVM; and training step by step to obtain the SVM classifier corresponding to each emotion type.
CN202111127502.7A 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method Active CN113870901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111127502.7A CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111127502.7A CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Publications (2)

Publication Number Publication Date
CN113870901A true CN113870901A (en) 2021-12-31
CN113870901B CN113870901B (en) 2024-05-24

Family

ID=78994361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111127502.7A Active CN113870901B (en) 2021-09-26 2021-09-26 SVM-KNN-based voice emotion recognition method

Country Status (1)

Country Link
CN (1) CN113870901B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492384A (en) * 2017-07-14 2017-12-19 北京联合大学 A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm
CN108899046A (en) * 2018-07-12 2018-11-27 东北大学 A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
CN109036468A (en) * 2018-11-06 2018-12-18 渤海大学 Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core
KR20190102667A (en) * 2018-02-27 2019-09-04 광주과학기술원 Emotion recognition system and method thereof
CN111832438A (en) * 2020-06-27 2020-10-27 西安电子科技大学 Electroencephalogram signal channel selection method and system for emotion recognition and application

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492384A (en) * 2017-07-14 2017-12-19 北京联合大学 A kind of speech-emotion recognition method based on fuzzy nearest neighbor algorithm
KR20190102667A (en) * 2018-02-27 2019-09-04 광주과학기술원 Emotion recognition system and method thereof
CN108899046A (en) * 2018-07-12 2018-11-27 东北大学 A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
CN109036468A (en) * 2018-11-06 2018-12-18 渤海大学 Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core
CN111832438A (en) * 2020-06-27 2020-10-27 西安电子科技大学 Electroencephalogram signal channel selection method and system for emotion recognition and application

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王光艳;张培玟;于宝芸;: "基于SVM多分类算法的汉语语音情感信息智能识别", 电子元器件与信息技术, no. 07, 20 July 2020 (2020-07-20) *

Also Published As

Publication number Publication date
CN113870901B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
Shahin et al. Emotion recognition using hybrid Gaussian mixture model and deep neural network
Basu et al. A review on emotion recognition using speech
Schuller et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture
Mannepalli et al. Emotion recognition in speech signals using optimization based multi-SVNN classifier
Deshwal et al. A language identification system using hybrid features and back-propagation neural network
Yeh et al. Segment-based emotion recognition from continuous Mandarin Chinese speech
CN110827857B (en) Speech emotion recognition method based on spectral features and ELM
Semwal et al. Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models
Palo et al. Efficient feature combination techniques for emotional speech classification
Nawas et al. Speaker recognition using random forest
Rabiee et al. Persian accents identification using an adaptive neural network
Kawade et al. Speech Emotion Recognition Using 1D CNN-LSTM Network on Indo-Aryan Database
Alrehaili et al. Arabic speech dialect classification using deep learning
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Prakash et al. Analysis of emotion recognition system through speech signal using KNN & GMM classifier
Chaudhary et al. Feature selection and classification of indian musical string instruments using svm
Aggarwal et al. Application of genetically optimized neural networks for hindi speech recognition system
CN113870901B (en) SVM-KNN-based voice emotion recognition method
Mathur et al. A study of machine learning algorithms in speech recognition and language identification system
Hasan Bird Species Classification And Acoustic Features Selection Based on Distributed Neural Network with Two Stage Windowing of Short-Term Features
Mangalam et al. Emotion Recognition from Mizo Speech: A Signal Processing Approach
Gade et al. Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
Ashrafidoost et al. Recognizing Emotional State Changes Using Speech Processing
Lakra et al. Automated pitch-based gender recognition using an adaptive neuro-fuzzy inference system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant