CN106504772A - Speech-emotion recognition method based on weights of importance support vector machine classifier - Google Patents
Speech-emotion recognition method based on weights of importance support vector machine classifier Download PDFInfo
- Publication number
- CN106504772A CN106504772A CN201610969948.7A CN201610969948A CN106504772A CN 106504772 A CN106504772 A CN 106504772A CN 201610969948 A CN201610969948 A CN 201610969948A CN 106504772 A CN106504772 A CN 106504772A
- Authority
- CN
- China
- Prior art keywords
- frame
- sample
- weights
- prime
- importance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Abstract
The invention discloses a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, including the quantization of training sample and test sample deviation, the foundation of the SVM based on the foundation of weights of importance Modulus Model and weights of importance coefficient.The method quantifies the deviation of training sample and test sample on the basis of weights of importance coefficient, deviation adjustment so as to carry out in grader aspect.The present invention is by building the weights of importance Modulus Model that there is deviation for training sample in emotional semantic classification and test sample, quantify the deviation of training sample and test sample in speech samples, using the SVM classifier based on weights of importance Modulus Model, deviation adjustment to carry out by adjusting Optimal Separating Hyperplane in grader aspect, improve the accuracy and stability of speech emotion recognition.
Description
Technical field
The present invention relates to a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, belongs to language
Sound emotion recognition technical field.
Background technology
With the rise of the fast-developing and various intelligent terminal of information technology, existing man-machine interactive system is faced with day
Beneficial acid test.In order to overcome the obstacle of man-machine interaction, make man-machine interaction more convenient, natural, the emotion intelligence of machine is just
It is increasingly subject to the attention of each area research person.Voice as extremely potential high efficiency interactive medium in man-machine interaction now,
Carry abundant emotion information.Important subject of the speech emotion recognition as emotion intelligence, surveys in remote teaching, auxiliary
Before the aspects such as lie, automatic remote telephone service center and clinical medicine, intelligent toy, smart mobile phone have wide application
Scape, has attracted the extensive concern of more and more research institutions and researcher.
In speech emotion recognition is improved, training sample is different from the environment of the acquisition time of test sample and collection,
The skew of covariant is there is between training sample and test sample, in order to improve precision and the robust of speech emotion recognition
Property, the deviation which is present is compensated and seems most important.The deviation produced because of speech sample environment is excluded, from raw tone
The redundancies such as the unrelated content information of speaking of similar emotion are rejected in data, effective emotion information is extracted, and are to improve language
The emphasis of sound emotion recognition system robustness and difficult point.
Used as a kind of emerging voice technology, weights of importance Modulus Model is because of its motility in speech signal processing
And effectiveness, increasingly obtain the extensive attention of researcher.For classification problem, quantify on the basis of weights of importance coefficient
Training sample and the deviation of test sample, deviation adjustment so as to carry out in grader aspect, environmental factorss are reduced to voice feelings
The other impact of perception, improves the accuracy and stability of speech emotion recognition.This in grader aspect compensation training sample
The method of this covariant deviation existed and test sample between, has great importance in speech emotion recognition research.
Content of the invention
Technical problem:The present invention provides a kind of robustness that can improve speech emotion recognition, by grader aspect
The covariant deviation that exists between compensation training sample and test sample based on weights of importance support vector machine classifier
Speech-emotion recognition method, can reduce sample and record environment and record the irrelevant informations such as people for the impact of speech recognition, can
To improve the precision and robustness of speech emotion recognition.
Technical scheme:The speech-emotion recognition method based on weights of importance support vector machine classifier of the present invention, bag
Include following steps:
Step 1:Pretreatment is carried out to the voice signal being input into, and extracts characteristic vector di;
Step 2:The sample set of input is divided into training sample setAnd test sample collectionAnd from the survey
Examination sample setIn randomly select b template point, cl, compositionWhereinIt is a sample of the training sample concentration
This,It is a sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample is concentrated
Number of samples, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test specimens for randomly selecting
This concentration sample sequence number;
Step 3:Calculate the optimal Gaussian kernel width of basic functionIdiographic flow is as follows:
Step 3.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 3.2:Precompensation parameter vector α is calculated according to below scheme:
Step 3.2.1:Calculated according to following formulaBuild withB × b matrixes for element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test specimens for randomly selecting
This collectionIn a bit, l ' is that the test sample for randomly selecting concentrates sample index;
Step 3.2.2:Calculated according to following formulaBuild withB dimensional vectors for element
It is the vector of b dimensions,It isIn element;
Step 3.2.3:Calculate precompensation parameter vector α:
With α >=0 as constraints, optimization problem is calculatedCalculate following formula and take minimum
The value of parameter vector α during value:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTurn
Put vector;
Step 3.3:Cross validation calculates the optimal Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording under
Formula calculates the approximating variances of r-th weights of importance under cross validation and expects:
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is
R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample
This steWeights of importance estimate that its computing formula is as follows:
Wherein αlIt is l-th element in step 3.2.3 in calculated precompensation parameter vector α;
By default 10 σ values:0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates
Under cross validation, the approximating variances of weights of importance is expectedBy minimumIt is worth corresponding σ most preferably high as basic function
This core width
Wherein r=1,2 ... R;
Step 4:With α >=0 as constraints, optimization problem is calculatedObtain optimal parameter vector
WhereinL, l '
=1,2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element;
Step 5:Weights of importance β (s) is calculated by following formula:
WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D are
The set of training test sample point;
Step 6:Set up weights of importance SVM classifier:
Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, is obtained as follows
SVM classifier expression formula:
The SVM classifier expression formula is constituted weights of importance SVM classifier with as foretold constraints:
yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, diBe by pretreatment it
Training sample afterwardsThe characteristic vector of extraction, yi∈ {+1, -1 } is class label, and they constitute training sample (d1, y1),
(d2, y2) ..., (dl, yl), βiIt is training sample point (di, yi) weights of importance, ξiIt is training sample point (di, yi) pine
Relaxation variable;
Step 7:The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up
Carry out the identification of speech emotional.
Further, in the inventive method, the pretreatment in the step 1 comprises the steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the letter of the voice after preemphasis is obtained
Number
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,WithRepresent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the language after preemphasis
Message number?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with
The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length
16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framingThe speech frame setIn
The data of n-th discrete point of the individual speech frames of k ' are:
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is speech frame sequence
Number, K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frameSelection window length is that 256 points of Hamming window w is carried out at adding window
Reason, obtains adding window speech frame xk′For:
Wherein xk′(n)、W (n) represents x respectivelyk′、Values of the w on n-th discrete point, length of window is 256
Point Hamming window function be:
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′:
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window language
Sound frame xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn
[xk′(n-1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than
Threshold value tEAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated that efficient voice frame is made
For the start frame of the currently active speech frame set, differentiate efficient voice frame as the currently active voice the maximum one-level of frame number
The end frame of frame set;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, press
Differentiated according to the descending order of frame number frame by frame, short-time zero-crossing rate is more than threshold value tZAdding window speech frame be labeled as effective language
Sound frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value
tZAdding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is designated as { pk}1≤k≤K, wherein k be efficient voice frame number, K
For efficient voice frame totalframes, pkFor k-th efficient voice frame in efficient voice frame set.
Further, characteristic vector d in the inventive method, in the step 5iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic is as low order descriptor, right
The low order descriptor of sentence carries out statistical computation and obtains statement level feature di.
The statistical nature of sentence sample is (as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level
Spectral coefficient and set forth herein wavelet packet cepstrum coefficient feature etc.) as low order descriptor (Low Level Descriptor,
LLD), statement level characteristic parameter obtained from statistical computation is carried out by all short-time characteristics to sentence.
The statistic that commonly uses in speech emotional feature extraction is listed in Table 1 below:
Table 1
Wherein short-time characteristic is:Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250-
650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position
Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.
Beneficial effect:The present invention compared with prior art, with advantages below:
In existing speech-emotion recognition method, not to existing between training sample and test sample in practical application
Covariant skew account for, so as to cause the effect of actual speech emotion recognition application than speech emotional under experimental situation
The effect of identification is worse.Weights of importance Modulus Model is set up in the present invention, substantially considers the test in practical application
Between sample and training sample exist each species diversity, i.e., between training sample and test sample exist covariant offset into
Row quantifies, and is calculated weights of importance factor beta and is the quantized value, and this can intuitively show training sample and test specimens
Deviation between this.In the extraction of follow-up speech emotional feature, covariant can be passed through during the foundation of grader inclined
Quantized value β is moved, deviation is compensated, so as to significantly exclude because the deviation of speech sample environment generation is for speech emotional is known
Other impact.Compared with the speech-emotion recognition method of other deviation compensations, weights of importance Modulus Model is set up, for instruction
The deviation that practices between sample and test sample is quantified, and reduces computational complexity and the difficulty of covariant compensation.
Based on weights of importance Modulus Model is set up, in SVM classifier, by increasing weights of importance coefficient, to instruction
The deviation that practices between sample and test sample is compensated.Compared with other SVM classifier recognition methodss, this method is in classics
Weights of importance is introduced in the object function of SVM classifier, the technology using on-fixed penalty factor is equivalent to, according to importance
Weight coefficient, for the big data increase penalty coefficient of weight, so as to be adjusted to Optimal Separating Hyperplane, reduces environmental factorss
Impact to speech emotion recognition, improves the accuracy and stability of speech emotion recognition in practical application, than others
Standard SVM has more preferable classifying quality.
Description of the drawings
Fig. 1 is the weights of importance SVM training flow chart of the present invention.
Fig. 2 is the weights of importance flow chart of the present invention.
Specific embodiment
With reference to embodiment and Figure of description, the present invention is further illustrated.
The speech emotional characteristic extraction method based on content robust of speaking of the present invention, comprises the following steps:
Step 1:For the sample set of input carries out pretreatment, pretreated input training sample set is obtainedWith
Test sample collectionAnd from after pretreatment test sample collectionIn b template point randomly selectingWherein
It is training sample is concentrated after pretreatment a sample,It is test sample is concentrated after pretreatment a sample, ntrIt is training
Sample set number of samples, nteIt is test sample collection number of samples, clBe fromIn the template point that randomly selects, i is training sample
This concentration sample index, j are that test sample concentrates sample index, and l is that the test sample for randomly selecting concentrates sample index.
Wherein pretreatment specifically includes following steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X, the voice signal after preemphasis is obtained
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,WithRepresent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the language after preemphasis
Message number?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with
The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length
16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framing
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is speech frame sequence
Number, K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frameSelection window length is that 256 points of Hamming window w is carried out at adding window
Reason, obtains adding window speech frame xk′For:
Wherein xk′(n)、W (n) represents x respectivelyk′、Values of the w on n-th discrete point, length of window is 256
Point Hamming window function be:
End-point detection is completed using known energy zero-crossing rate dual-threshold judgement method subsequently, is comprised the following steps that:
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′:
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window language
Sound frame xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn
[xk′(n-1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, short-time energy value is more than threshold
Value tEAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as
The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame by the start frame of the currently active speech frame set
The end frame of set, then with short-time zero-crossing rate make the second level differentiation, i.e., to the currently active speech frame set, with start frame for
Point, differentiates frame by frame according to the descending order of frame number, and short-time zero-crossing rate is more than threshold value tZAdding window speech frame be labeled as
Efficient voice frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is big
In threshold value tZAdding window speech frame be labeled as efficient voice frame, the efficient voice frame set obtained after two-stage is differentiated is designated as
{sk}1≤k≤K, wherein k be efficient voice frame number, K be efficient voice frame totalframes, skFor k-th in efficient voice frame set
Efficient voice frame.
Step 2:Calculate the optimal Gaussian kernel width of basic function
For training sample data and the degree of closeness of the distribution of test sample data, it is possible to use weights of importance β (s)
To represent:
Wherein ptrS () represents training sample set after pretreatmentTraining sample distribution density, pteS () represents pre- and locates
Test sample collection after reasonTest sample distribution density.
Step 2.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 2.2:Calculate precompensation parameter vector α:
β (s) is modeled as by linear model:
α=(α1, α2..., αl) ',It is basic function,S ∈ D, l=1,2 ..., b, b andCan be according to sampleWithDetermine.
Calculate variance J0(α):
Above formula last be constant term, can be ignored, represent first two using J (α):
Transposed vectors of the wherein α ' for vector α, H is the matrix of b × b:H is b dimensions
Vector:
The expectation that J (α) is approached using averaging method, the approximating variances for obtaining weights of importance are expected
WhereinIt is the matrix of b × b: It is the vector of b dimensions: It is vectorTransposed vector.
The nonnegativity restrictionss of consideration weights of importance β (x), are converted into optimization problem:
Constraints:α≥0
The optimization problem is calculated, parameter vector α is the optimal solution of the planning problem.
CalculatingWithWhen,Be a core width be σ gaussian kernel function,
WillSubstitute intoWithIn, you can calculate:
WhereinForIn element,ForIn element, l, l '=1,2 ..., b, cl′Be fromIn random
The template point of selection, l ' are that the test sample for randomly selecting concentrates sample index, and σ is 1 in preset value.
Step 2.3:Cross validation calculates optimal basic function Gaussian kernel width
Training sample set after pretreatmentAnd test sample collectionIt is respectively classified into R subsetWithCalculate following formula:
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is
R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample
This steWeights of importance estimate.
The approximating variances for calculating weights of importance under cross validation is expected
Wherein r=1,2 ... R.
By minimizingTo solve, i.e. following formula obtains optimal solutionAs optimal basic function Gaussian kernel width
Wherein σ is respectively preset value 0.1,0.2 ..., 1.
Step 3:Calculate optimal parameter vector
Using the Gaussian bases obtained in step 2 and optimal basic function Gaussian kernel widthSubstitute into and calculateWithSuch as
Following formula:
Wherein l, l '=1,2 ..., b;
Optimization problem is calculated using formula (9) (10)Constraints is α >=0, can be calculated
Optimal parameter vector
Step 4:Calculate approximate significance weight
β (x) can be obtained by step 2 to be modeled as by linear modelSubstitute into basic function
Can obtain:
WhereinFor vectorIn element, s be train test sample point in a sample, s ∈ D, D for training test
The set of sample point.
Step 5:Set up weights of importance SVM classifier model:
Weights of importance is added in slack variable ξ of standard SVM classifier as coefficient:
Wherein constraints:yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L, w are the standard vectors of Optimal Separating Hyperplane,
| w | is that the mould of w is long, and ξ is slack variable, and C is punishment parameter, diIt is by training sampleThe characteristic vector of extraction, yi∈{+
1, -1 } it is class label, they constitute training sample (d1, y1), (d2, y2) ..., (dl, yl), βiIt is training sample point (di, yi)
Weights of importance.
The statistical nature of sentence sample is (as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level
Spectral coefficient and set forth herein wavelet packet cepstrum coefficient feature etc.) as low order descriptor (Low Level Descriptor,
LLD), statement level characteristic parameter obtained from statistical computation is carried out by all short-time characteristics to sentence.
The statistics sound that commonly uses in speech emotional feature extraction is listed in Table 1 below:
Table 1
Wherein short-time characteristic is:Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250-
650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position
Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.Formula (13) and its constraints are weights of importance SVM point
Class device model.
Above-described embodiment is only the preferred embodiment of the present invention, it should be pointed out that:Ordinary skill for the art
For personnel, under the premise without departing from the principles of the invention, some improvement and equivalent can also be made, these are to the present invention
Claim is improved and the technical scheme after equivalent, each falls within protection scope of the present invention.
Claims (3)
1. a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, it is characterised in that the method
Comprise the following steps:
Step 1:Pretreatment is carried out to the voice signal being input into, and extracts characteristic vector di;
Step 2:The sample set of input is divided into training sample setAnd test sample collectionAnd from the test specimens
This collectionIn randomly select b template point, cl, compositionWhereinIt is a sample of the training sample concentration,It is a sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample concentrates sample
Number, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test sample collection for randomly selecting
Middle sample sequence number;
Step 3:Calculate the optimal Gaussian kernel width of basic functionIdiographic flow is as follows:
Step 3.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 3.2:Precompensation parameter vector α is calculated according to below scheme:
Step 3.2.1:Calculated according to following formulaBuild withB × b matrixes for element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test sample collection for randomly selectingA bit of sum, l ' are that the test sample for randomly selecting concentrates sample index;
Step 3.2.2:Calculated according to following formulaBuild withB dimensional vectors for element
It is the vector of b dimensions,It isIn element;
Step 3.2.3:Calculate precompensation parameter vector α:
With α >=0 as constraints, optimization problem is calculatedWhen i.e. calculating following formula takes minima
The value of parameter vector α:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTransposition to
Amount;
Step 3.3:Cross validation calculates the optimal Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording to following formula meter
The approximating variances for calculating r-th weights of importance under cross validation is expected:
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is r-th instruction
Practice sample set,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn
A sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample ste
Weights of importance estimate that its computing formula is as follows:
Wherein αlIt is l-th element in step 3.2.3 in calculated precompensation parameter vector α;
By default 10 σ values:0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates intersection
The approximating variances of the lower weights of importance of checking is expectedBy minimumDirectly corresponding σ is used as the optimal gaussian kernel of basic function
Width
Wherein r=1,2 ... R;
Step 4:With α >=0 as constraints, optimization problem is calculatedObtain optimal parameter vector
WhereinL, l '=1,
2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element;
Step 5:Weights of importance β (s) is calculated by following formula:
WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D for training
The set of test sample point;
Step 6:Set up weights of importance SVM classifier:
Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, following SVM point is obtained
Class device expression formula:
The SVM classifier expression formula is constituted weights of importance SVM classifier with following constraints:
yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, diBe by pretreatment after instruction
Practice sampleThe characteristic vector of extraction, yi∈ {+1, -1 } is class label, and they constitute training sample (d1, y1), (d2,
y2) ..., (dl, yl), βiIt is training sample point (di, yi) weights of importance, ξiIt is training sample point (di, yi) lax change
Amount;
Step 7:The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up is carried out
The identification of speech emotional.
2. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1, its
It is characterised by, the pretreatment in the step 1 comprises the steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the voice signal after preemphasis is obtained
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,With
Represent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the voice signal after preemphasis
?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point and rear
The distance of frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length takes
16ms, that is, take at 256 points,Speech frame set is obtained through framingThe speech frame setMiddle kth '
The data of n-th discrete point of individual speech frame are:
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is voice frame number,
K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frame1≤k '≤K ', selection window length are that 256 points of Hamming window w carries out windowing process,
Obtain adding window speech frame xk′For:
Wherein xk′(n)、W (n) represents x respectivelyk′、Values of the w on n-th discrete point, length of window are 256 points
Hamming window function is:
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′:
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window speech frame
xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn[xk′(n-
1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than threshold value tE
Adding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as current
The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame set by the start frame of efficient voice frame set
End frame;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, according to frame
The descending order of sequence number differentiates frame by frame, by short-time zero-crossing rate more than threshold value tZAdding window speech frame be labeled as efficient voice
Frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value tZ
Adding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is designated as { pk}1≤k≤K, wherein k is efficient voice frame number, and K is effective
Speech frame totalframes, pkFor k-th efficient voice frame in efficient voice frame set.
3. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1 and 2,
Characterized in that, characteristic vector d in the step 5iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic as low order descriptor, to sentence
Low order descriptor carry out statistical computation and obtain statement level feature di.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610969948.7A CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610969948.7A CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106504772A true CN106504772A (en) | 2017-03-15 |
CN106504772B CN106504772B (en) | 2019-08-20 |
Family
ID=58322831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610969948.7A Active CN106504772B (en) | 2016-11-04 | 2016-11-04 | Speech-emotion recognition method based on weights of importance support vector machine classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106504772B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108364641A (en) * | 2018-01-09 | 2018-08-03 | 东南大学 | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise |
CN108735233A (en) * | 2017-04-24 | 2018-11-02 | 北京理工大学 | A kind of personality recognition methods and device |
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
WO2020024210A1 (en) * | 2018-08-02 | 2020-02-06 | 深圳大学 | Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device |
CN110991238A (en) * | 2019-10-30 | 2020-04-10 | 中国科学院自动化研究所南京人工智能芯片创新研究院 | Speech auxiliary system based on speech emotion analysis and micro-expression recognition |
CN111415680A (en) * | 2020-03-26 | 2020-07-14 | 心图熵动科技(苏州)有限责任公司 | Method for generating anxiety prediction model based on voice and anxiety prediction system |
CN113434698A (en) * | 2021-06-30 | 2021-09-24 | 华中科技大学 | Relation extraction model establishing method based on full-hierarchy attention and application thereof |
CN116801456A (en) * | 2023-08-22 | 2023-09-22 | 深圳市创洺盛光电科技有限公司 | Intelligent control method of LED lamp |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080077720A (en) * | 2007-02-21 | 2008-08-26 | 인하대학교 산학협력단 | A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector |
KR20110021328A (en) * | 2009-08-26 | 2011-03-04 | 인하대학교 산학협력단 | The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training |
CN102201237A (en) * | 2011-05-12 | 2011-09-28 | 浙江大学 | Emotional speaker identification method based on reliability detection of fuzzy support vector machine |
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN104091602A (en) * | 2014-07-11 | 2014-10-08 | 电子科技大学 | Speech emotion recognition method based on fuzzy support vector machine |
-
2016
- 2016-11-04 CN CN201610969948.7A patent/CN106504772B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080077720A (en) * | 2007-02-21 | 2008-08-26 | 인하대학교 산학협력단 | A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector |
KR20110021328A (en) * | 2009-08-26 | 2011-03-04 | 인하대학교 산학협력단 | The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training |
CN102201237A (en) * | 2011-05-12 | 2011-09-28 | 浙江大学 | Emotional speaker identification method based on reliability detection of fuzzy support vector machine |
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN104091602A (en) * | 2014-07-11 | 2014-10-08 | 电子科技大学 | Speech emotion recognition method based on fuzzy support vector machine |
Non-Patent Citations (2)
Title |
---|
YONGMING HUANG ET AL.: "Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition", 《IET SIGNAL PROCESS》 * |
秦宇强,张雪英: "基于SVM的语音信号情感识别", 《电路与系统学报》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108735233A (en) * | 2017-04-24 | 2018-11-02 | 北京理工大学 | A kind of personality recognition methods and device |
CN108364641A (en) * | 2018-01-09 | 2018-08-03 | 东南大学 | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise |
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
WO2020024210A1 (en) * | 2018-08-02 | 2020-02-06 | 深圳大学 | Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device |
CN110991238A (en) * | 2019-10-30 | 2020-04-10 | 中国科学院自动化研究所南京人工智能芯片创新研究院 | Speech auxiliary system based on speech emotion analysis and micro-expression recognition |
CN111415680A (en) * | 2020-03-26 | 2020-07-14 | 心图熵动科技(苏州)有限责任公司 | Method for generating anxiety prediction model based on voice and anxiety prediction system |
CN113434698A (en) * | 2021-06-30 | 2021-09-24 | 华中科技大学 | Relation extraction model establishing method based on full-hierarchy attention and application thereof |
CN113434698B (en) * | 2021-06-30 | 2022-08-02 | 华中科技大学 | Relation extraction model establishing method based on full-hierarchy attention and application thereof |
CN116801456A (en) * | 2023-08-22 | 2023-09-22 | 深圳市创洺盛光电科技有限公司 | Intelligent control method of LED lamp |
Also Published As
Publication number | Publication date |
---|---|
CN106504772B (en) | 2019-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106504772B (en) | Speech-emotion recognition method based on weights of importance support vector machine classifier | |
CN108305616B (en) | Audio scene recognition method and device based on long-time and short-time feature extraction | |
CN103345923B (en) | A kind of phrase sound method for distinguishing speek person based on rarefaction representation | |
CN109256150B (en) | Speech emotion recognition system and method based on machine learning | |
CN101599271B (en) | Recognition method of digital music emotion | |
CN110853680B (en) | double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy | |
Nwe et al. | Speech based emotion classification | |
CN107146601A (en) | A kind of rear end i vector Enhancement Methods for Speaker Recognition System | |
CN106328121A (en) | Chinese traditional musical instrument classification method based on depth confidence network | |
CN107393554A (en) | In a kind of sound scene classification merge class between standard deviation feature extracting method | |
CN102800316A (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN107146615A (en) | Audio recognition method and system based on the secondary identification of Matching Model | |
CN103236258B (en) | Based on the speech emotional characteristic extraction method that Pasteur's distance wavelet packets decomposes | |
CN111210803B (en) | System and method for training clone timbre and rhythm based on Bottle sock characteristics | |
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
CN103985381A (en) | Voice frequency indexing method based on parameter fusion optimized decision | |
CN109410911A (en) | Artificial intelligence learning method based on speech recognition | |
CN109754790A (en) | A kind of speech recognition system and method based on mixing acoustic model | |
CN102237083A (en) | Portable interpretation system based on WinCE platform and language recognition method thereof | |
CN108461085A (en) | A kind of method for distinguishing speek person under the conditions of Short Time Speech | |
CN107274887A (en) | Speaker's Further Feature Extraction method based on fusion feature MGFCC | |
CN105070300A (en) | Voice emotion characteristic selection method based on speaker standardization change | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
CN108364641A (en) | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise | |
CN110047504A (en) | Method for distinguishing speek person under identity vector x-vector linear transformation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |