CN106504772A - Speech-emotion recognition method based on weights of importance support vector machine classifier - Google Patents

Speech-emotion recognition method based on weights of importance support vector machine classifier Download PDF

Info

Publication number
CN106504772A
CN106504772A CN201610969948.7A CN201610969948A CN106504772A CN 106504772 A CN106504772 A CN 106504772A CN 201610969948 A CN201610969948 A CN 201610969948A CN 106504772 A CN106504772 A CN 106504772A
Authority
CN
China
Prior art keywords
frame
sample
weights
prime
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610969948.7A
Other languages
Chinese (zh)
Other versions
CN106504772B (en
Inventor
黄永明
吴奥
章国宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610969948.7A priority Critical patent/CN106504772B/en
Publication of CN106504772A publication Critical patent/CN106504772A/en
Application granted granted Critical
Publication of CN106504772B publication Critical patent/CN106504772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Abstract

The invention discloses a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, including the quantization of training sample and test sample deviation, the foundation of the SVM based on the foundation of weights of importance Modulus Model and weights of importance coefficient.The method quantifies the deviation of training sample and test sample on the basis of weights of importance coefficient, deviation adjustment so as to carry out in grader aspect.The present invention is by building the weights of importance Modulus Model that there is deviation for training sample in emotional semantic classification and test sample, quantify the deviation of training sample and test sample in speech samples, using the SVM classifier based on weights of importance Modulus Model, deviation adjustment to carry out by adjusting Optimal Separating Hyperplane in grader aspect, improve the accuracy and stability of speech emotion recognition.

Description

Speech-emotion recognition method based on weights of importance support vector machine classifier
Technical field
The present invention relates to a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, belongs to language Sound emotion recognition technical field.
Background technology
With the rise of the fast-developing and various intelligent terminal of information technology, existing man-machine interactive system is faced with day Beneficial acid test.In order to overcome the obstacle of man-machine interaction, make man-machine interaction more convenient, natural, the emotion intelligence of machine is just It is increasingly subject to the attention of each area research person.Voice as extremely potential high efficiency interactive medium in man-machine interaction now, Carry abundant emotion information.Important subject of the speech emotion recognition as emotion intelligence, surveys in remote teaching, auxiliary Before the aspects such as lie, automatic remote telephone service center and clinical medicine, intelligent toy, smart mobile phone have wide application Scape, has attracted the extensive concern of more and more research institutions and researcher.
In speech emotion recognition is improved, training sample is different from the environment of the acquisition time of test sample and collection, The skew of covariant is there is between training sample and test sample, in order to improve precision and the robust of speech emotion recognition Property, the deviation which is present is compensated and seems most important.The deviation produced because of speech sample environment is excluded, from raw tone The redundancies such as the unrelated content information of speaking of similar emotion are rejected in data, effective emotion information is extracted, and are to improve language The emphasis of sound emotion recognition system robustness and difficult point.
Used as a kind of emerging voice technology, weights of importance Modulus Model is because of its motility in speech signal processing And effectiveness, increasingly obtain the extensive attention of researcher.For classification problem, quantify on the basis of weights of importance coefficient Training sample and the deviation of test sample, deviation adjustment so as to carry out in grader aspect, environmental factorss are reduced to voice feelings The other impact of perception, improves the accuracy and stability of speech emotion recognition.This in grader aspect compensation training sample The method of this covariant deviation existed and test sample between, has great importance in speech emotion recognition research.
Content of the invention
Technical problem:The present invention provides a kind of robustness that can improve speech emotion recognition, by grader aspect The covariant deviation that exists between compensation training sample and test sample based on weights of importance support vector machine classifier Speech-emotion recognition method, can reduce sample and record environment and record the irrelevant informations such as people for the impact of speech recognition, can To improve the precision and robustness of speech emotion recognition.
Technical scheme:The speech-emotion recognition method based on weights of importance support vector machine classifier of the present invention, bag Include following steps:
Step 1:Pretreatment is carried out to the voice signal being input into, and extracts characteristic vector di
Step 2:The sample set of input is divided into training sample setAnd test sample collectionAnd from the survey Examination sample setIn randomly select b template point, cl, compositionWhereinIt is a sample of the training sample concentration This,It is a sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample is concentrated Number of samples, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test specimens for randomly selecting This concentration sample sequence number;
Step 3:Calculate the optimal Gaussian kernel width of basic functionIdiographic flow is as follows:
Step 3.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 3.2:Precompensation parameter vector α is calculated according to below scheme:
Step 3.2.1:Calculated according to following formulaBuild withB × b matrixes for element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test specimens for randomly selecting This collectionIn a bit, l ' is that the test sample for randomly selecting concentrates sample index;
Step 3.2.2:Calculated according to following formulaBuild withB dimensional vectors for element
It is the vector of b dimensions,It isIn element;
Step 3.2.3:Calculate precompensation parameter vector α:
With α >=0 as constraints, optimization problem is calculatedCalculate following formula and take minimum The value of parameter vector α during value:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTurn Put vector;
Step 3.3:Cross validation calculates the optimal Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording under Formula calculates the approximating variances of r-th weights of importance under cross validation and expects:
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample This steWeights of importance estimate that its computing formula is as follows:
Wherein αlIt is l-th element in step 3.2.3 in calculated precompensation parameter vector α;
By default 10 σ values:0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates Under cross validation, the approximating variances of weights of importance is expectedBy minimumIt is worth corresponding σ most preferably high as basic function This core width
Wherein r=1,2 ... R;
Step 4:With α >=0 as constraints, optimization problem is calculatedObtain optimal parameter vector
WhereinL, l ' =1,2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element;
Step 5:Weights of importance β (s) is calculated by following formula:
WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D are The set of training test sample point;
Step 6:Set up weights of importance SVM classifier:
Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, is obtained as follows SVM classifier expression formula:
The SVM classifier expression formula is constituted weights of importance SVM classifier with as foretold constraints:
yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, diBe by pretreatment it Training sample afterwardsThe characteristic vector of extraction, yi∈ {+1, -1 } is class label, and they constitute training sample (d1, y1), (d2, y2) ..., (dl, yl), βiIt is training sample point (di, yi) weights of importance, ξiIt is training sample point (di, yi) pine Relaxation variable;
Step 7:The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up Carry out the identification of speech emotional.
Further, in the inventive method, the pretreatment in the step 1 comprises the steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the letter of the voice after preemphasis is obtained Number
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,WithRepresent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the language after preemphasis Message number?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framingThe speech frame setIn The data of n-th discrete point of the individual speech frames of k ' are:
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is speech frame sequence Number, K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frameSelection window length is that 256 points of Hamming window w is carried out at adding window Reason, obtains adding window speech frame xk′For:
Wherein xk′(n)、W (n) represents x respectivelyk′Values of the w on n-th discrete point, length of window is 256 Point Hamming window function be:
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window language Sound frame xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn [xk′(n-1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than Threshold value tEAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated that efficient voice frame is made For the start frame of the currently active speech frame set, differentiate efficient voice frame as the currently active voice the maximum one-level of frame number The end frame of frame set;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, press Differentiated according to the descending order of frame number frame by frame, short-time zero-crossing rate is more than threshold value tZAdding window speech frame be labeled as effective language Sound frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value tZAdding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is designated as { pk}1≤k≤K, wherein k be efficient voice frame number, K For efficient voice frame totalframes, pkFor k-th efficient voice frame in efficient voice frame set.
Further, characteristic vector d in the inventive method, in the step 5iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic is as low order descriptor, right The low order descriptor of sentence carries out statistical computation and obtains statement level feature di.
The statistical nature of sentence sample is (as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level Spectral coefficient and set forth herein wavelet packet cepstrum coefficient feature etc.) as low order descriptor (Low Level Descriptor, LLD), statement level characteristic parameter obtained from statistical computation is carried out by all short-time characteristics to sentence.
The statistic that commonly uses in speech emotional feature extraction is listed in Table 1 below:
Table 1
Wherein short-time characteristic is:Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.
Beneficial effect:The present invention compared with prior art, with advantages below:
In existing speech-emotion recognition method, not to existing between training sample and test sample in practical application Covariant skew account for, so as to cause the effect of actual speech emotion recognition application than speech emotional under experimental situation The effect of identification is worse.Weights of importance Modulus Model is set up in the present invention, substantially considers the test in practical application Between sample and training sample exist each species diversity, i.e., between training sample and test sample exist covariant offset into Row quantifies, and is calculated weights of importance factor beta and is the quantized value, and this can intuitively show training sample and test specimens Deviation between this.In the extraction of follow-up speech emotional feature, covariant can be passed through during the foundation of grader inclined Quantized value β is moved, deviation is compensated, so as to significantly exclude because the deviation of speech sample environment generation is for speech emotional is known Other impact.Compared with the speech-emotion recognition method of other deviation compensations, weights of importance Modulus Model is set up, for instruction The deviation that practices between sample and test sample is quantified, and reduces computational complexity and the difficulty of covariant compensation.
Based on weights of importance Modulus Model is set up, in SVM classifier, by increasing weights of importance coefficient, to instruction The deviation that practices between sample and test sample is compensated.Compared with other SVM classifier recognition methodss, this method is in classics Weights of importance is introduced in the object function of SVM classifier, the technology using on-fixed penalty factor is equivalent to, according to importance Weight coefficient, for the big data increase penalty coefficient of weight, so as to be adjusted to Optimal Separating Hyperplane, reduces environmental factorss Impact to speech emotion recognition, improves the accuracy and stability of speech emotion recognition in practical application, than others Standard SVM has more preferable classifying quality.
Description of the drawings
Fig. 1 is the weights of importance SVM training flow chart of the present invention.
Fig. 2 is the weights of importance flow chart of the present invention.
Specific embodiment
With reference to embodiment and Figure of description, the present invention is further illustrated.
The speech emotional characteristic extraction method based on content robust of speaking of the present invention, comprises the following steps:
Step 1:For the sample set of input carries out pretreatment, pretreated input training sample set is obtainedWith Test sample collectionAnd from after pretreatment test sample collectionIn b template point randomly selectingWherein It is training sample is concentrated after pretreatment a sample,It is test sample is concentrated after pretreatment a sample, ntrIt is training Sample set number of samples, nteIt is test sample collection number of samples, clBe fromIn the template point that randomly selects, i is training sample This concentration sample index, j are that test sample concentrates sample index, and l is that the test sample for randomly selecting concentrates sample index.
Wherein pretreatment specifically includes following steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X, the voice signal after preemphasis is obtained
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,WithRepresent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the language after preemphasis Message number?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framing
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is speech frame sequence Number, K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frameSelection window length is that 256 points of Hamming window w is carried out at adding window Reason, obtains adding window speech frame xk′For:
Wherein xk′(n)、W (n) represents x respectivelyk′Values of the w on n-th discrete point, length of window is 256 Point Hamming window function be:
End-point detection is completed using known energy zero-crossing rate dual-threshold judgement method subsequently, is comprised the following steps that:
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window language Sound frame xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn [xk′(n-1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, short-time energy value is more than threshold Value tEAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame by the start frame of the currently active speech frame set The end frame of set, then with short-time zero-crossing rate make the second level differentiation, i.e., to the currently active speech frame set, with start frame for Point, differentiates frame by frame according to the descending order of frame number, and short-time zero-crossing rate is more than threshold value tZAdding window speech frame be labeled as Efficient voice frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is big In threshold value tZAdding window speech frame be labeled as efficient voice frame, the efficient voice frame set obtained after two-stage is differentiated is designated as {sk}1≤k≤K, wherein k be efficient voice frame number, K be efficient voice frame totalframes, skFor k-th in efficient voice frame set Efficient voice frame.
Step 2:Calculate the optimal Gaussian kernel width of basic function
For training sample data and the degree of closeness of the distribution of test sample data, it is possible to use weights of importance β (s) To represent:
Wherein ptrS () represents training sample set after pretreatmentTraining sample distribution density, pteS () represents pre- and locates Test sample collection after reasonTest sample distribution density.
Step 2.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 2.2:Calculate precompensation parameter vector α:
β (s) is modeled as by linear model:
α=(α1, α2..., αl) ',It is basic function,S ∈ D, l=1,2 ..., b, b andCan be according to sampleWithDetermine.
Calculate variance J0(α):
Above formula last be constant term, can be ignored, represent first two using J (α):
Transposed vectors of the wherein α ' for vector α, H is the matrix of b × b:H is b dimensions Vector:
The expectation that J (α) is approached using averaging method, the approximating variances for obtaining weights of importance are expected
WhereinIt is the matrix of b × b: It is the vector of b dimensions: It is vectorTransposed vector.
The nonnegativity restrictionss of consideration weights of importance β (x), are converted into optimization problem:
Constraints:α≥0
The optimization problem is calculated, parameter vector α is the optimal solution of the planning problem.
CalculatingWithWhen,Be a core width be σ gaussian kernel function,
WillSubstitute intoWithIn, you can calculate:
WhereinForIn element,ForIn element, l, l '=1,2 ..., b, cl′Be fromIn random The template point of selection, l ' are that the test sample for randomly selecting concentrates sample index, and σ is 1 in preset value.
Step 2.3:Cross validation calculates optimal basic function Gaussian kernel width
Training sample set after pretreatmentAnd test sample collectionIt is respectively classified into R subsetWithCalculate following formula:
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample This steWeights of importance estimate.
The approximating variances for calculating weights of importance under cross validation is expected
Wherein r=1,2 ... R.
By minimizingTo solve, i.e. following formula obtains optimal solutionAs optimal basic function Gaussian kernel width
Wherein σ is respectively preset value 0.1,0.2 ..., 1.
Step 3:Calculate optimal parameter vector
Using the Gaussian bases obtained in step 2 and optimal basic function Gaussian kernel widthSubstitute into and calculateWithSuch as Following formula:
Wherein l, l '=1,2 ..., b;
Optimization problem is calculated using formula (9) (10)Constraints is α >=0, can be calculated Optimal parameter vector
Step 4:Calculate approximate significance weight
β (x) can be obtained by step 2 to be modeled as by linear modelSubstitute into basic function
Can obtain:
WhereinFor vectorIn element, s be train test sample point in a sample, s ∈ D, D for training test The set of sample point.
Step 5:Set up weights of importance SVM classifier model:
Weights of importance is added in slack variable ξ of standard SVM classifier as coefficient:
Wherein constraints:yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L, w are the standard vectors of Optimal Separating Hyperplane, | w | is that the mould of w is long, and ξ is slack variable, and C is punishment parameter, diIt is by training sampleThe characteristic vector of extraction, yi∈{+ 1, -1 } it is class label, they constitute training sample (d1, y1), (d2, y2) ..., (dl, yl), βiIt is training sample point (di, yi) Weights of importance.
The statistical nature of sentence sample is (as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level Spectral coefficient and set forth herein wavelet packet cepstrum coefficient feature etc.) as low order descriptor (Low Level Descriptor, LLD), statement level characteristic parameter obtained from statistical computation is carried out by all short-time characteristics to sentence.
The statistics sound that commonly uses in speech emotional feature extraction is listed in Table 1 below:
Table 1
Wherein short-time characteristic is:Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.Formula (13) and its constraints are weights of importance SVM point Class device model.
Above-described embodiment is only the preferred embodiment of the present invention, it should be pointed out that:Ordinary skill for the art For personnel, under the premise without departing from the principles of the invention, some improvement and equivalent can also be made, these are to the present invention Claim is improved and the technical scheme after equivalent, each falls within protection scope of the present invention.

Claims (3)

1. a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, it is characterised in that the method Comprise the following steps:
Step 1:Pretreatment is carried out to the voice signal being input into, and extracts characteristic vector di
Step 2:The sample set of input is divided into training sample setAnd test sample collectionAnd from the test specimens This collectionIn randomly select b template point, cl, compositionWhereinIt is a sample of the training sample concentration,It is a sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample concentrates sample Number, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test sample collection for randomly selecting Middle sample sequence number;
Step 3:Calculate the optimal Gaussian kernel width of basic functionIdiographic flow is as follows:
Step 3.1:Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1;
Step 3.2:Precompensation parameter vector α is calculated according to below scheme:
Step 3.2.1:Calculated according to following formulaBuild withB × b matrixes for element
H ^ l , l &prime; = 1 n t r &Sigma; i = 1 n exp ( - | | s i t r - c l | | 2 + | | s i t r - c l &prime; | | 2 2 &sigma; 2 ) - - - ( 1 )
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test sample collection for randomly selectingA bit of sum, l ' are that the test sample for randomly selecting concentrates sample index;
Step 3.2.2:Calculated according to following formulaBuild withB dimensional vectors for element
h ^ l = 1 n t e &Sigma; j = 1 n t e exp ( - | | s j t e - c l | | 2 2 &sigma; 2 ) - - - ( 2 )
It is the vector of b dimensions,It isIn element;
Step 3.2.3:Calculate precompensation parameter vector α:
With α >=0 as constraints, optimization problem is calculatedWhen i.e. calculating following formula takes minima The value of parameter vector α:
J ^ ( &alpha; ) = 1 2 &alpha; &prime; H ^ &alpha; - h ^ &prime; &alpha; - - - ( 3 )
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTransposition to Amount;
Step 3.3:Cross validation calculates the optimal Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording to following formula meter The approximating variances for calculating r-th weights of importance under cross validation is expected:
J ^ r ( C V ) = 1 2 | S r t r | &Sigma; x t r &Element; S r t r &beta; ^ r ( s t r ) 2 - 1 | S r t e | &Sigma; x t e &Element; S r t e &beta; ^ r ( s t e ) - - - ( 4 )
WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is r-th instruction Practice sample set,It is r-th test sample subset,It isSample number,It isSample number, strIt isIn A sample, steIt isIn a sample,It is sample strWeights of importance estimate,It is sample ste Weights of importance estimate that its computing formula is as follows:
&beta; ^ r ( s t e ) = &Sigma; l = 1 b &alpha; l exp ( - | | s t e - c l | | 2 2 &sigma; 2 ) - - - ( 5 )
&beta; ^ r ( s t r ) = &Sigma; l = 1 b &alpha; l exp ( - | | s t r - c l | | 2 2 &sigma; 2 ) - - - ( 6 )
Wherein αlIt is l-th element in step 3.2.3 in calculated precompensation parameter vector α;
By default 10 σ values:0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates intersection The approximating variances of the lower weights of importance of checking is expectedBy minimumDirectly corresponding σ is used as the optimal gaussian kernel of basic function Width
J ^ ( C V ) = 1 R &Sigma; r = 1 R J ^ r ( C V ) - - - ( 7 )
Wherein r=1,2 ... R;
Step 4:With α >=0 as constraints, optimization problem is calculatedObtain optimal parameter vector
WhereinL, l '=1, 2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element;
Step 5:Weights of importance β (s) is calculated by following formula:
&beta; ( s ) = &Sigma; l = 1 b &alpha; ^ l exp ( - | | s - c l | | 2 2 &sigma; ^ 2 ) - - - ( 8 )
WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D for training The set of test sample point;
Step 6:Set up weights of importance SVM classifier:
Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, following SVM point is obtained Class device expression formula:
min w 1 2 | w | 2 + C &Sigma; i = 1 L &beta; i &xi; i - - - ( 9 )
The SVM classifier expression formula is constituted weights of importance SVM classifier with following constraints:
yi(<W, di>+b)≥1-ξi, ξi>=0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, diBe by pretreatment after instruction Practice sampleThe characteristic vector of extraction, yi∈ {+1, -1 } is class label, and they constitute training sample (d1, y1), (d2, y2) ..., (dl, yl), βiIt is training sample point (di, yi) weights of importance, ξiIt is training sample point (di, yi) lax change Amount;
Step 7:The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up is carried out The identification of speech emotional.
2. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1, its It is characterised by, the pretreatment in the step 1 comprises the steps:
Step 1.1:Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the voice signal after preemphasis is obtained
X &OverBar; ( n &OverBar; ) = X ( n &OverBar; ) - 0.9375 X ( n &OverBar; - 1 ) , 0 &le; n &OverBar; &le; N &OverBar; - 1
WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,With Represent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the voice signal after preemphasis ?Value on individual discrete point, X (- 1)=0;
Step 1.2:Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point and rear The distance of frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length takes 16ms, that is, take at 256 points,Speech frame set is obtained through framingThe speech frame setMiddle kth ' The data of n-th discrete point of individual speech frame are:
x &OverBar; k , ( n ) = X &OverBar; ( n + 128 ( k &prime; - 1 ) ) 0 &le; n &le; 255 , 1 &le; k &prime; &le; K &prime;
WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is voice frame number, K ' is speech frame totalframes, and meets:
RepresentRound downwards;
Step 1.3:To each speech frame1≤k '≤K ', selection window length are that 256 points of Hamming window w carries out windowing process, Obtain adding window speech frame xk′For:
x k &prime; ( n ) = x &OverBar; k &prime; ( n ) w ( n ) 0 &le; n &le; 255 , 1 &le; k &prime; &le; K &prime;
Wherein xk′(n)、W (n) represents x respectivelyk′Values of the w on n-th discrete point, length of window are 256 points Hamming window function is:
w ( n ) = 0.54 - 0.46 c o s ( 2 &pi; n 255 ) 0 &le; n &le; 255
Step 1.4:To each adding window speech frame xk′, 1≤k '≤K ', calculating short-time energy Ek′With short-time zero-crossing rate Zk′
E k &prime; = &Sigma; n = 0 255 x k &prime; 2 ( n ) 1 &le; k &prime; &le; K &prime;
Z k &prime; = 1 2 &Sigma; n = 1 255 | sgn &lsqb; x k &prime; ( n ) &rsqb; - sgn &lsqb; x k &prime; ( n - 1 ) &rsqb; |
Wherein Ek′Represent adding window speech frame xk′Short-time energy, Zk′Represent xk′Short-time zero-crossing rate, xk′N () is adding window speech frame xk′Value on n-th sampled point, xk′(n-1) it is xk′Value on (n-1)th sampled point, sgn [xk′(n)]、sgn[xk′(n- 1)] x is respectivelyk′(n)、xk′(n-1) sign function, i.e.,:
sgn &lsqb; &lambda; &rsqb; = 1 , &lambda; &GreaterEqual; 0 - 1 , &lambda; < 0
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5:Determine short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ
t E = 1 K &prime; &Sigma; k = 1 K &prime; E k &prime;
t Z = 0.1 K &prime; &Sigma; k &prime; = 1 K &prime; Z k &prime;
Wherein K ' is speech frame totalframes;
Step 1.6:To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than threshold value tE Adding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as current The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame set by the start frame of efficient voice frame set End frame;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, according to frame The descending order of sequence number differentiates frame by frame, by short-time zero-crossing rate more than threshold value tZAdding window speech frame be labeled as efficient voice Frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value tZ Adding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is designated as { pk}1≤k≤K, wherein k is efficient voice frame number, and K is effective Speech frame totalframes, pkFor k-th efficient voice frame in efficient voice frame set.
3. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1 and 2, Characterized in that, characteristic vector d in the step 5iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic as low order descriptor, to sentence Low order descriptor carry out statistical computation and obtain statement level feature di.
CN201610969948.7A 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier Active CN106504772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610969948.7A CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610969948.7A CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Publications (2)

Publication Number Publication Date
CN106504772A true CN106504772A (en) 2017-03-15
CN106504772B CN106504772B (en) 2019-08-20

Family

ID=58322831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610969948.7A Active CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Country Status (1)

Country Link
CN (1) CN106504772B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364641A (en) * 2018-01-09 2018-08-03 东南大学 A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
CN108735233A (en) * 2017-04-24 2018-11-02 北京理工大学 A kind of personality recognition methods and device
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
WO2020024210A1 (en) * 2018-08-02 2020-02-06 深圳大学 Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device
CN110991238A (en) * 2019-10-30 2020-04-10 中国科学院自动化研究所南京人工智能芯片创新研究院 Speech auxiliary system based on speech emotion analysis and micro-expression recognition
CN111415680A (en) * 2020-03-26 2020-07-14 心图熵动科技(苏州)有限责任公司 Method for generating anxiety prediction model based on voice and anxiety prediction system
CN113434698A (en) * 2021-06-30 2021-09-24 华中科技大学 Relation extraction model establishing method based on full-hierarchy attention and application thereof
CN116801456A (en) * 2023-08-22 2023-09-22 深圳市创洺盛光电科技有限公司 Intelligent control method of LED lamp

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080077720A (en) * 2007-02-21 2008-08-26 인하대학교 산학협력단 A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector
KR20110021328A (en) * 2009-08-26 2011-03-04 인하대학교 산학협력단 The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN104091602A (en) * 2014-07-11 2014-10-08 电子科技大学 Speech emotion recognition method based on fuzzy support vector machine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080077720A (en) * 2007-02-21 2008-08-26 인하대학교 산학협력단 A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector
KR20110021328A (en) * 2009-08-26 2011-03-04 인하대학교 산학협력단 The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN104091602A (en) * 2014-07-11 2014-10-08 电子科技大学 Speech emotion recognition method based on fuzzy support vector machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YONGMING HUANG ET AL.: "Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition", 《IET SIGNAL PROCESS》 *
秦宇强,张雪英: "基于SVM的语音信号情感识别", 《电路与系统学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108735233A (en) * 2017-04-24 2018-11-02 北京理工大学 A kind of personality recognition methods and device
CN108364641A (en) * 2018-01-09 2018-08-03 东南大学 A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
WO2020024210A1 (en) * 2018-08-02 2020-02-06 深圳大学 Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device
CN110991238A (en) * 2019-10-30 2020-04-10 中国科学院自动化研究所南京人工智能芯片创新研究院 Speech auxiliary system based on speech emotion analysis and micro-expression recognition
CN111415680A (en) * 2020-03-26 2020-07-14 心图熵动科技(苏州)有限责任公司 Method for generating anxiety prediction model based on voice and anxiety prediction system
CN113434698A (en) * 2021-06-30 2021-09-24 华中科技大学 Relation extraction model establishing method based on full-hierarchy attention and application thereof
CN113434698B (en) * 2021-06-30 2022-08-02 华中科技大学 Relation extraction model establishing method based on full-hierarchy attention and application thereof
CN116801456A (en) * 2023-08-22 2023-09-22 深圳市创洺盛光电科技有限公司 Intelligent control method of LED lamp

Also Published As

Publication number Publication date
CN106504772B (en) 2019-08-20

Similar Documents

Publication Publication Date Title
CN106504772B (en) Speech-emotion recognition method based on weights of importance support vector machine classifier
CN108305616B (en) Audio scene recognition method and device based on long-time and short-time feature extraction
CN103345923B (en) A kind of phrase sound method for distinguishing speek person based on rarefaction representation
CN109256150B (en) Speech emotion recognition system and method based on machine learning
CN101599271B (en) Recognition method of digital music emotion
CN110853680B (en) double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy
Nwe et al. Speech based emotion classification
CN107146601A (en) A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN106328121A (en) Chinese traditional musical instrument classification method based on depth confidence network
CN107393554A (en) In a kind of sound scene classification merge class between standard deviation feature extracting method
CN102800316A (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN107146615A (en) Audio recognition method and system based on the secondary identification of Matching Model
CN103236258B (en) Based on the speech emotional characteristic extraction method that Pasteur&#39;s distance wavelet packets decomposes
CN111210803B (en) System and method for training clone timbre and rhythm based on Bottle sock characteristics
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN109410911A (en) Artificial intelligence learning method based on speech recognition
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN108461085A (en) A kind of method for distinguishing speek person under the conditions of Short Time Speech
CN107274887A (en) Speaker&#39;s Further Feature Extraction method based on fusion feature MGFCC
CN105070300A (en) Voice emotion characteristic selection method based on speaker standardization change
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN108364641A (en) A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
CN110047504A (en) Method for distinguishing speek person under identity vector x-vector linear transformation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant