CN106504772B - Speech-emotion recognition method based on weights of importance support vector machine classifier - Google Patents
Speech-emotion recognition method based on weights of importance support vector machine classifier Download PDFInfo
- Publication number
- CN106504772B CN106504772B CN201610969948.7A CN201610969948A CN106504772B CN 106504772 B CN106504772 B CN 106504772B CN 201610969948 A CN201610969948 A CN 201610969948A CN 106504772 B CN106504772 B CN 106504772B
- Authority
- CN
- China
- Prior art keywords
- frame
- sample
- weights
- importance
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012706 support-vector machine Methods 0.000 title claims abstract description 9
- 238000012360 testing method Methods 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 48
- 230000002996 emotional effect Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 23
- 239000012141 concentrate Substances 0.000 claims description 12
- 238000002790 cross-validation Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- 230000004069 differentiation Effects 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 102000002274 Matrix Metalloproteinases Human genes 0.000 claims description 2
- 108010000684 Matrix Metalloproteinases Proteins 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims description 2
- 230000008909 emotion recognition Effects 0.000 abstract description 13
- 238000013139 quantization Methods 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 103
- 230000008451 emotion Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000005100 correlation spectroscopy Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 235000008331 Pinus X rigitaeda Nutrition 0.000 description 1
- 235000011613 Pinus brutia Nutrition 0.000 description 1
- 241000018646 Pinus brutia Species 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012468 concentrated sample Substances 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Artificial Intelligence (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of speech-emotion recognition methods based on weights of importance support vector machine classifier, the quantization including training sample and test sample deviation, the foundation of weights of importance Modulus Model and the foundation of the SVM based on weights of importance coefficient.This method quantifies the deviation of training sample and test sample on the basis of weights of importance coefficient, to carry out deviation adjusting in classifier level.There is the weights of importance Modulus Model of deviation training sample in emotional semantic classification and test sample by constructing in the present invention, quantify the deviation of training sample and test sample in speech samples, utilize the SVM classifier based on weights of importance Modulus Model, deviation adjusting is carried out by adjusting Optimal Separating Hyperplane in classifier level, improves the accuracy and stability of speech emotion recognition.
Description
Technical field
The present invention relates to a kind of speech-emotion recognition methods based on weights of importance support vector machine classifier, belong to language
Sound emotion recognition technical field.
Background technique
With the fast development of information technology and the rise of various intelligent terminals, existing man-machine interactive system is faced with day
Beneficial acid test.In order to overcome the obstacle of human-computer interaction, make human-computer interaction it is more convenient, naturally, the emotion intelligence of machine is just
It is increasingly subject to the attention of each area research person.Voice as the high efficiency interactive medium in human-computer interaction now with development potential,
Carry emotion information abundant.Important subject of the speech emotion recognition as emotion intelligence is surveyed in remote teaching, auxiliary
Lie, automatic remote telephone service center and clinical medicine, intelligent toy, before smart phone etc. has wide application
Scape has attracted more and more research institutions and the extensive concern of researcher.
In improving speech emotion recognition, training sample is different from the environment of the acquisition time of test sample and acquisition,
There is the offsets of covariant between training sample and test sample, in order to improve the precision and robust of speech emotion recognition
Property, compensating to deviation existing for it seems most important.The deviation generated by speech sample environment is excluded, from raw tone
The redundancies such as the unrelated speech content information of similar emotion are rejected in data, extract effective emotion information, are to improve language
The key points and difficulties of sound emotion recognition system robustness.
As a kind of emerging voice technology, weights of importance Modulus Model is because of its flexibility in speech signal processing
And validity, increasingly obtain the extensive attention of researcher.For classification problem, quantify on the basis of weights of importance coefficient
The deviation of training sample and test sample reduces environmental factor to voice feelings to carry out deviation adjusting in classifier level
The other influence of perception improves the accuracy and stability of speech emotion recognition.It is this in classifier level compensation training sample
The method of this existing covariant deviation between test sample has great importance in speech emotion recognition research.
Summary of the invention
Technical problem: the present invention provides a kind of robustness that can be improved speech emotion recognition, by classifier level
Existing covariant deviation based on weights of importance support vector machine classifier between compensation training sample and test sample
Speech-emotion recognition method can reduce sample and record environment and record the influences of the irrelevant informations for speech recognition such as people, can
To improve the precision and robustness of speech emotion recognition.
Technical solution: the speech-emotion recognition method of the invention based on weights of importance support vector machine classifier, packet
Include following steps:
Step 1: the voice signal of input being pre-processed, and extracts feature vector di;
Step 2: the sample set of input is divided into training sample setAnd test sample collectionAnd from the survey
Try sample setIn randomly select b template point, cl, compositionWhereinIt is one that the training sample is concentrated
Sample,It is the sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is test sample collection
Middle number of samples, i are that training sample concentrates sample serial number, and j is that test sample concentrates sample serial number, and l is the test randomly selected
Sample serial number in sample set;
Step 3: calculating the best Gaussian kernel width of basic functionDetailed process is as follows:
Step 3.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 3.2: precompensation parameter vector α is calculated according to following below scheme:
Step 3.2.1: it calculates according to the following formulaBuilding withFor b × b matrix of element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test specimens randomly selected
This collectionIn a bit, l ' is that the test sample that randomly selects concentrates sample index;
Step 3.2.2: it calculates according to the following formulaBuilding withFor the b dimensional vector of element
It is the vector of b dimension,It isIn element;
Step 3.2.3: precompensation parameter vector α is calculated:
It is constraint condition with α >=0, calculates optimization problemIt calculates following formula and takes minimum
The value of parameter vector α when value:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTurn
Set vector;
Step 3.3: cross validation calculates the best Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithUnder
Formula calculates the approximating variances expectation of r-th of weights of importance under cross validation:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is
R training sample subset,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample
This steWeights of importance estimation, calculation formula is as follows:
Wherein αlIt is first of element in the precompensation parameter vector α being calculated in step 3.2.3;
By preset 10 σ values: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula calculating respectively
The approximating variances expectation of weights of importance under cross validationIt will be the smallestIt is most preferably high as basic function to be worth corresponding σ
This core width
Wherein r=1,2 ... R;
Step 4: being constraint condition with α >=0, calculate optimization problemObtain optimal parameter vector
Wherein
WhereinIt is matrixIn l row l ' column element,It is one-dimensional vectorIn first of element;
Step 5: it is calculate by the following formula weights of importance β (s):
WhereinFor optimal parameter vectorIn element, s be training test sample point in a sample, s ∈ D, D are
The set of training test sample point;
Step 6: establish weights of importance SVM classifier:
It is added on the slack variable ξ of standard SVM classifier, obtains as follows using the weights of importance β (s) as coefficient
SVM classifier expression formula:
The SVM classifier expression formula and following constraint condition are constituted into weights of importance SVM classifier:
yi(<w,di>+b)≥1-ξi,ξi≥0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, | w | be w mould it is long, C is punishment parameter, diIt is by pre-processing it
Training sample set afterwardsThe feature vector of extraction, yi∈ {+1, -1 } is class label, they form training sample (d1,y1),
(d2,y2) ..., (dl,yl), βiIt is training sample point (di,yi) weights of importance, ξiIt is training sample point (di,yi) pine
Relaxation variable;
Step 7: the weights of importance SVM classifier that the feature vector and the step 6 extracted using the step 1 are established
Carry out the identification of speech emotional.
Further, in the method for the present invention, the pretreatment in the step 1 includes the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X according to the following formula, the voice letter after obtaining preemphasis
Number
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis
Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with
The distance of latter frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length
16ms is taken, that is, takes at 256 points,Speech frame set is obtained by framingThe speech frame setIn
The data of n-th of discrete point of k' speech frame are as follows:
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is speech frame sequence
Number, K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out adding window for 256 points of Hamming window w
Processing, obtains adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'、Value of the w on n-th of discrete point, length of window are
256 points of Hamming window function are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window language
Sound frame xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn
[xk'It (n-1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, i.e., be greater than short-time energy value
Threshold value tEAdding window speech frame differentiate efficient voice frame labeled as level-one, the smallest level-one of frame number is differentiated that efficient voice frame is made
For the start frame of the currently active speech frame set, differentiate efficient voice frame as the currently active voice the maximum level-one of frame number
The end frame of frame set;
Then make second level differentiation with short-time zero-crossing rate, i.e., the currently active speech frame set is pressed using start frame as starting point
Differentiate frame by frame according to the descending sequence of frame number, short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as effective language
Sound frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is greater than threshold value
tZAdding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is denoted as { pk}1≤k≤K, wherein k is efficient voice frame number, K
For efficient voice frame totalframes, pkFor k-th of efficient voice frame in efficient voice frame set.
Further, the feature vector d in the method for the present invention, in the step 1iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic are right as low order descriptor
The low order descriptor of sentence carries out that statement level feature is calculated.
The statistical nature of sentence sample is that (such as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level
Spectral coefficient and wavelet packet cepstrum coefficient feature proposed in this paper etc.) as low order descriptor (Low Level Descriptor,
LLD), statement level characteristic parameter obtained from carrying out statistics calculating as all short-time characteristics to sentence.
Common statistic is listed in Table 1 below in speech emotional feature extraction:
Table 1
Wherein short-time characteristic are as follows: fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250-
650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Meier Correlated Spectroscopy position
It sets, 90%, 75%, 50%, 25% Meier correlation spectrum roll-off point.
The utility model has the advantages that compared with prior art, the present invention having the advantage that
In existing speech-emotion recognition method, do not exist between training sample and test sample in practical application
Covariant offset account for, so as to cause actual speech emotion recognition application effect than speech emotional under experimental situation
The effect of identification is worse.Weights of importance Modulus Model is established in the present invention, substantially considers the test in practical application
Existing each species diversity between sample and training sample, i.e., to covariant existing between training sample and test sample deviate into
Row quantization, it is the quantized value that weights of importance factor beta, which is calculated, this can intuitively show training sample and test specimens
Deviation between this.It can be inclined by covariant in the extraction of subsequent speech emotional feature, the establishment process of classifier
Quantized value β is moved, deviation is compensated, so that the deviation for excluding to generate by speech sample environment knows speech emotional significantly
Other influence.Compared with the speech-emotion recognition method of other deviation compensations, weights of importance Modulus Model is established, for instruction
The deviation practiced between sample and test sample is quantified, and the computational complexity and difficulty of covariant compensation are reduced.
Based on weights of importance Modulus Model is established, in SVM classifier, by increasing weights of importance coefficient, to instruction
The deviation practiced between sample and test sample compensates.Compared with other SVM classifier recognition methods, this method is in classics
Weights of importance is introduced in the objective function of SVM classifier, the technology using on-fixed penalty factor is equivalent to, according to importance
Weight coefficient, the data big for weight increase penalty coefficient and reduce environmental factor to be adjusted to Optimal Separating Hyperplane
Influence to speech emotion recognition improves the accuracy and stability of speech emotion recognition in practical application, than others
Standard SVM has better classifying quality.
Detailed description of the invention
Fig. 1 is weights of importance SVM training flow chart of the invention.
Fig. 2 is weights of importance flow chart of the invention.
Specific embodiment
Below with reference to embodiment and Figure of description, the present invention is further illustrated.
Speech emotional characteristic extraction method based on speech content robust of the invention, comprising the following steps:
Step 1: the sample set of input being pre-processed, pretreated input training sample set is obtainedWith
Test sample collectionAnd the test sample collection after pretreatmentIn b template point randomly selectingWherein
It is the sample that training sample is concentrated after pre-processing,It is the sample that test sample is concentrated after pre-processing, ntrIt is trained
Sample set number of samples, nteIt is test sample collection number of samples, clBe fromIn the template point that randomly selects, i is trained sample
This concentration sample index, j are that test sample concentrates sample index, and l is that the test sample randomly selected concentrates sample index.
Wherein pretreatment specifically comprises the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X, the voice signal after obtaining preemphasis
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis
Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with
The distance of latter frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length
16ms is taken, that is, takes at 256 points,Speech frame set is obtained by framing
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is speech frame sequence
Number, K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out adding window for 256 points of Hamming window w
Processing, obtains adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'、Value of the w on n-th of discrete point, length of window are
256 points of Hamming window function are as follows:
It is subsequent that end-point detection is completed using well known energy zero-crossing rate dual-threshold judgement method, the specific steps are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window language
Sound frame xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn
[xk'It (n-1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, short-time energy value is greater than threshold
Value tEAdding window speech frame differentiate efficient voice frame labeled as level-one, using the smallest level-one of frame number differentiate efficient voice frame as
The maximum level-one of frame number is differentiated efficient voice frame as the currently active speech frame by the start frame of the currently active speech frame set
Then the end frame of set makees second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, be with start frame
Point differentiates frame by frame according to the descending sequence of frame number, and short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as
Efficient voice frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is big
In threshold value tZAdding window speech frame be labeled as efficient voice frame, the efficient voice frame set obtained after two-stage is differentiated is denoted as
{sk}1≤k≤K, wherein k is efficient voice frame number, and K is efficient voice frame totalframes, skFor k-th in efficient voice frame set
Efficient voice frame.
Step 2: calculating the best Gaussian kernel width of basic function
For the degree of closeness of training sample data and the distribution of test sample data, weights of importance β (s) can be used
To indicate:
Wherein ptr(s) training sample set after pre-processing is indicatedTraining sample distribution density, pte(s) pre- place is indicated
Test sample collection after reasonTest sample distribution density.
Step 2.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 2.2: calculate precompensation parameter vector α:
β (s) is simulated by linear model are as follows:
α=(α1,α2,...,αl) ',It is basic function,B and
It can be according to sampleWithIt determines.
Calculate variance J0(α):
Last is constant term to above formula, can be ignored, first two are indicated using J (α):
Wherein α ' is the transposed vector of vector α, and H is the matrix of b × b:H is b dimension
Vector:
The expectation that J (α) is approached using the method for average obtains the approximating variances expectation of weights of importance
WhereinIt is the matrix of b × b: It is the vector of b dimension: It is vectorTransposed vector.
The nonnegativity restrictions for considering weights of importance β (x), is converted into optimization problem:
Constraint condition: α >=0
The optimization problem is calculated, parameter vector α is the optimal solution of the planning problem.
It is calculatingWithWhen,It is the gaussian kernel function that a core width is σ,
It willIt substitutes intoWithIn, it can calculate:
WhereinForIn element,ForIn element, l, l '=1,2 ..., b, cl′Be fromIn it is random
The template point of selection, l ' are that the test sample randomly selected concentrates sample index, and σ is 1 in preset value.
Step 2.3: cross validation calculates best basic function Gaussian kernel width
The training sample set after pretreatmentAnd test sample collectionIt is respectively classified into R subsetWithCalculate following formula:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is
R training sample subset,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample
This steWeights of importance estimation.
Calculate the approximating variances expectation of weights of importance under cross validation
Wherein r=1,2 ... R.
Pass through minimumIt solves, i.e. following formula, obtains optimal solutionAs best basic function Gaussian kernel width
Wherein σ is respectively preset value 0.1,0.2 ..., 1.
Step 3: calculating optimal parameter vector
Utilize Gaussian bases obtained in step 2 and best basic function Gaussian kernel widthIt substitutes into and calculatesWithSuch as
Following formula:
Wherein l, l '=1,2 ..., b;
Optimization problem is calculated using formula (9) (10)Constraint condition is α >=0, can be calculated
To optimal parameter vector
Step 4: calculating approximate significance weight
β (x) can be obtained by step 2 to be modeled as by linear modelSubstitute into basic function
It can obtain:
WhereinFor vectorIn element, s is a sample in training test sample point, and s ∈ D, D are training test
The set of sample point.
Step 5: establish weights of importance SVM classifier model:
It is added on the slack variable ξ of standard SVM classifier using weights of importance as coefficient:
Wherein constraint condition: yi(<w,di>+b)≥1-ξi,ξi>=0,1≤i≤L, w are the standard vectors of Optimal Separating Hyperplane,
| w | be w mould it is long, ξ is slack variable, and C is punishment parameter, diIt is by training sampleThe feature vector of extraction, yi∈{+
It 1, -1 } is class label, they form training sample (d1,y1), (d2,y2) ..., (dl,yl), βiIt is training sample point (di,yi)
Weights of importance.
The statistical nature of sentence sample is that (such as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level
Spectral coefficient and wavelet packet cepstrum coefficient feature proposed in this paper etc.) as low order descriptor (Low Level Descriptor,
LLD), statement level characteristic parameter obtained from carrying out statistics calculating as all short-time characteristics to sentence.
Common statistic is listed in Table 1 below in speech emotional feature extraction:
Table 1
Wherein short-time characteristic are as follows: fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250-
650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Meier Correlated Spectroscopy position
It sets, 90%, 75%, 50%, 25% Meier correlation spectrum roll-off point.Formula (13) and its constraint condition are weights of importance SVM points
Class device model.
Above-described embodiment is only the preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill of the art
For personnel, without departing from the principle of the present invention, several improvement and equivalent replacement can also be made, these are to the present invention
Claim improve with the technical solution after equivalent replacement, each fall within protection scope of the present invention.
Claims (3)
1. a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, which is characterized in that this method
The following steps are included:
Step 1: the voice signal of input being pre-processed, and extracts feature vector di;
Step 2: the sample set of input is divided into training sample setAnd test sample collectionAnd from the test specimens
This collectionIn randomly select b template point, cl, compositionWhereinIt is the sample that the training sample is concentrated,It is the sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample concentrates sample
Number, i are that training sample concentrates sample serial number, and j is that test sample concentrates sample serial number, and l is the test sample collection randomly selected
Middle sample serial number;
Step 3: calculating the best Gaussian kernel width of basic functionDetailed process is as follows:
Step 3.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 3.2: precompensation parameter vector α is calculated according to following below scheme:
Step 3.2.1: it calculates according to the following formulaBuilding withFor b × b matrix of element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test sample collection randomly selectedIn a bit, l ' is that the test sample that randomly selects concentrates sample index;
Step 3.2.2: it calculates according to the following formulaBuilding withFor the b dimensional vector of element
It is the vector of b dimension,It isIn element;
Step 3.2.3: precompensation parameter vector α is calculated:
It is constraint condition with α >=0, calculates optimization problemWhen i.e. calculating following formula is minimized
The value of parameter vector α:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTransposition to
Amount;
Step 3.3: cross validation calculates the best Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithIt counts according to the following formula
Calculate the approximating variances expectation of r-th of weights of importance under cross validation:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is r-th of instruction
Practice sample set,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn
A sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample ste
Weights of importance estimation, calculation formula is as follows:
Wherein αlIt is first of element in the precompensation parameter vector α being calculated in step 3.2.3;
By preset 10 σ values: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1, which substitutes into following formula respectively and calculates, to intersect
Verify the approximating variances expectation of lower weights of importanceIt will be the smallestIt is worth corresponding σ as the best Gaussian kernel of basic function
Width
Wherein r=1,2 ... R;
Step 4: being constraint condition with α >=0, calculate optimization problemObtain optimal parameter vector
WhereinL, l '=
1,2 ..., b, whereinIt is matrixIn l row l ' column element,It is one-dimensional vectorIn first of element;
Step 5: it is calculate by the following formula weights of importance β (s):
WhereinFor optimal parameter vectorIn element, s is a sample in training test sample point, and s ∈ D, D are training
The set of test sample point;
Step 6: establish weights of importance SVM classifier:
It is added on the slack variable ξ of standard SVM classifier using the weights of importance β (s) as coefficient, obtains following SVM points
Class device expression formula:
The SVM classifier expression formula and following constraint condition are constituted into weights of importance SVM classifier:
yi(<w,di>+b)≥1-ξi,ξi≥0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, | w | be w mould it is long, C is punishment parameter, diIt is by the instruction after pre-processing
Practice sample setThe feature vector of extraction, yi∈ {+1, -1 } is class label, they form training sample (d1,y1), (d2,
y2) ..., (dl,yl), βiIt is training sample point (di,yi) weights of importance, ξiIt is training sample point (di,yi) relaxation become
Amount;
Step 7: the weights of importance SVM classifier that the feature vector and the step 6 extracted using the step 1 are established carries out
The identification of speech emotional.
2. the speech-emotion recognition method according to claim 1 based on weights of importance support vector machine classifier,
It is characterized in that, the pretreatment in the step 1 includes the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X according to the following formula, the voice signal after obtaining preemphasis
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis
Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with it is latter
The distance of frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length takes
16ms takes at 256 points,Speech frame set is obtained by framingThe speech frame setMiddle kth '
The data of n-th of discrete point of a speech frame are as follows:
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is voice frame number,
K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out windowing process for 256 points of Hamming window w,
Obtain adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'、Value of the w on n-th of discrete point, length of window are 256 points
Hamming window function are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window speech frame
xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn[xk'(n-
It 1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, i.e., short-time energy value is greater than threshold value tE
Adding window speech frame differentiate efficient voice frame labeled as level-one, differentiate efficient voice frame as currently the smallest level-one of frame number
The maximum level-one of frame number is differentiated efficient voice frame as the currently active speech frame set by the start frame of efficient voice frame set
End frame;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, using start frame as starting point, according to frame
The descending sequence of serial number differentiates frame by frame, and short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as efficient voice
Frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is greater than threshold value tZ
Adding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is denoted as { pk}1≤k≤K, wherein k is efficient voice frame number, and K is effective
Speech frame totalframes, pkFor k-th of efficient voice frame in efficient voice frame set.
3. the speech-emotion recognition method according to claim 1 or 2 based on weights of importance support vector machine classifier,
It is characterized in that, the feature vector d in the step 1iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic are as low order descriptor, to sentence
Low order descriptor carry out that statement level feature is calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610969948.7A CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610969948.7A CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106504772A CN106504772A (en) | 2017-03-15 |
CN106504772B true CN106504772B (en) | 2019-08-20 |
Family
ID=58322831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610969948.7A Active CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106504772B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108735233A (en) * | 2017-04-24 | 2018-11-02 | 北京理工大学 | A kind of personality recognition methods and device |
CN108364641A (en) * | 2018-01-09 | 2018-08-03 | 东南大学 | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise |
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
WO2020024210A1 (en) * | 2018-08-02 | 2020-02-06 | 深圳大学 | Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device |
CN110991238B (en) * | 2019-10-30 | 2023-04-28 | 中科南京人工智能创新研究院 | Speech assisting system based on speech emotion analysis and micro expression recognition |
CN111415680B (en) * | 2020-03-26 | 2023-05-23 | 心图熵动科技(苏州)有限责任公司 | Voice-based anxiety prediction model generation method and anxiety prediction system |
CN113434698B (en) * | 2021-06-30 | 2022-08-02 | 华中科技大学 | Relation extraction model establishing method based on full-hierarchy attention and application thereof |
CN116363664A (en) * | 2023-04-10 | 2023-06-30 | 国网江苏省电力有限公司信息通信分公司 | OCR technology-based ciphertext-related book checking and labeling method and system |
CN116801456A (en) * | 2023-08-22 | 2023-09-22 | 深圳市创洺盛光电科技有限公司 | Intelligent control method of LED lamp |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080077720A (en) * | 2007-02-21 | 2008-08-26 | 인하대학교 산학협력단 | A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector |
KR20110021328A (en) * | 2009-08-26 | 2011-03-04 | 인하대학교 산학협력단 | The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training |
CN102201237A (en) * | 2011-05-12 | 2011-09-28 | 浙江大学 | Emotional speaker identification method based on reliability detection of fuzzy support vector machine |
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN104091602A (en) * | 2014-07-11 | 2014-10-08 | 电子科技大学 | Speech emotion recognition method based on fuzzy support vector machine |
-
2016
- 2016-11-04 CN CN201610969948.7A patent/CN106504772B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080077720A (en) * | 2007-02-21 | 2008-08-26 | 인하대학교 산학협력단 | A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector |
KR20110021328A (en) * | 2009-08-26 | 2011-03-04 | 인하대학교 산학협력단 | The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training |
CN102201237A (en) * | 2011-05-12 | 2011-09-28 | 浙江大学 | Emotional speaker identification method based on reliability detection of fuzzy support vector machine |
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN104091602A (en) * | 2014-07-11 | 2014-10-08 | 电子科技大学 | Speech emotion recognition method based on fuzzy support vector machine |
Non-Patent Citations (2)
Title |
---|
Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition;Yongming Huang et al.;《IET Signal Process》;20150615;第9卷(第4期);341-348 |
基于SVM的语音信号情感识别;秦宇强,张雪英;《电路与系统学报》;20121031;第17卷(第5期);55-59 |
Also Published As
Publication number | Publication date |
---|---|
CN106504772A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106504772B (en) | Speech-emotion recognition method based on weights of importance support vector machine classifier | |
CN106328121B (en) | Chinese Traditional Instruments sorting technique based on depth confidence network | |
CN109256150B (en) | Speech emotion recognition system and method based on machine learning | |
CN103345923B (en) | A kind of phrase sound method for distinguishing speek person based on rarefaction representation | |
CN109243494B (en) | Children emotion recognition method based on multi-attention mechanism long-time memory network | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN105469784B (en) | A kind of speaker clustering method and system based on probability linear discriminant analysis model | |
CN109493886A (en) | Speech-emotion recognition method based on feature selecting and optimization | |
CN108899049A (en) | A kind of speech-emotion recognition method and system based on convolutional neural networks | |
CN109815892A (en) | The signal recognition method of distributed fiber grating sensing network based on CNN | |
WO2016119604A1 (en) | Voice information search method and apparatus, and server | |
CN113066499B (en) | Method and device for identifying identity of land-air conversation speaker | |
CN102723078A (en) | Emotion speech recognition method based on natural language comprehension | |
CN102655003B (en) | Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient) | |
CN110767210A (en) | Method and device for generating personalized voice | |
CN103985381A (en) | Voice frequency indexing method based on parameter fusion optimized decision | |
CN111128128B (en) | Voice keyword detection method based on complementary model scoring fusion | |
Zhang et al. | Speech emotion recognition using combination of features | |
CN108364641A (en) | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise | |
CN107358947A (en) | Speaker recognition methods and system again | |
CN112417132B (en) | New meaning identification method for screening negative samples by using guest information | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN111128240B (en) | Voice emotion recognition method based on anti-semantic-erasure | |
CN115312080A (en) | Voice emotion recognition model and method based on complementary acoustic characterization | |
CN105070300A (en) | Voice emotion characteristic selection method based on speaker standardization change |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |