CN106504772A

CN106504772A - Speech-emotion recognition method based on weights of importance support vector machine classifier

Info

Publication number: CN106504772A
Application number: CN201610969948.7A
Authority: CN
Inventors: 黄永明; 吴奥; 章国宝
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-11-04
Filing date: 2016-11-04
Publication date: 2017-03-15
Anticipated expiration: 2036-11-04
Also published as: CN106504772B

Abstract

The invention discloses a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, including the quantization of training sample and test sample deviation, the foundation of the SVM based on the foundation of weights of importance Modulus Model and weights of importance coefficient.The method quantifies the deviation of training sample and test sample on the basis of weights of importance coefficient, deviation adjustment so as to carry out in grader aspect.The present invention is by building the weights of importance Modulus Model that there is deviation for training sample in emotional semantic classification and test sample, quantify the deviation of training sample and test sample in speech samples, using the SVM classifier based on weights of importance Modulus Model, deviation adjustment to carry out by adjusting Optimal Separating Hyperplane in grader aspect, improve the accuracy and stability of speech emotion recognition.

Description

Speech-emotion recognition method based on weights of importance support vector machine classifier

Technical field

The present invention relates to a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, belongs to language Sound emotion recognition technical field.

Background technology

With the rise of the fast-developing and various intelligent terminal of information technology, existing man-machine interactive system is faced with day Beneficial acid test.In order to overcome the obstacle of man-machine interaction, make man-machine interaction more convenient, natural, the emotion intelligence of machine is just It is increasingly subject to the attention of each area research person.Voice as extremely potential high efficiency interactive medium in man-machine interaction now, Carry abundant emotion information.Important subject of the speech emotion recognition as emotion intelligence, surveys in remote teaching, auxiliary Before the aspects such as lie, automatic remote telephone service center and clinical medicine, intelligent toy, smart mobile phone have wide application Scape, has attracted the extensive concern of more and more research institutions and researcher.

In speech emotion recognition is improved, training sample is different from the environment of the acquisition time of test sample and collection, The skew of covariant is there is between training sample and test sample, in order to improve precision and the robust of speech emotion recognition Property, the deviation which is present is compensated and seems most important.The deviation produced because of speech sample environment is excluded, from raw tone The redundancies such as the unrelated content information of speaking of similar emotion are rejected in data, effective emotion information is extracted, and are to improve language The emphasis of sound emotion recognition system robustness and difficult point.

Used as a kind of emerging voice technology, weights of importance Modulus Model is because of its motility in speech signal processing And effectiveness, increasingly obtain the extensive attention of researcher.For classification problem, quantify on the basis of weights of importance coefficient Training sample and the deviation of test sample, deviation adjustment so as to carry out in grader aspect, environmental factorss are reduced to voice feelings The other impact of perception, improves the accuracy and stability of speech emotion recognition.This in grader aspect compensation training sample The method of this covariant deviation existed and test sample between, has great importance in speech emotion recognition research.

Content of the invention

Technical problem：The present invention provides a kind of robustness that can improve speech emotion recognition, by grader aspect The covariant deviation that exists between compensation training sample and test sample based on weights of importance support vector machine classifier Speech-emotion recognition method, can reduce sample and record environment and record the irrelevant informations such as people for the impact of speech recognition, can To improve the precision and robustness of speech emotion recognition.

Technical scheme：The speech-emotion recognition method based on weights of importance support vector machine classifier of the present invention, bag Include following steps：

Step 1：Pretreatment is carried out to the voice signal being input into, and extracts characteristic vector d_i；

Step 2：The sample set of input is divided into training sample setAnd test sample collectionAnd from the survey Examination sample setIn randomly select b template point, c_l, compositionWhereinIt is a sample of the training sample concentration This,It is a sample that the test sample is concentrated, n_trIt is that training sample concentrates number of samples, n_teIt is that test sample is concentrated Number of samples, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test specimens for randomly selecting This concentration sample sequence number；

Step 3：Calculate the optimal Gaussian kernel width of basic functionIdiographic flow is as follows：

Step 3.1：Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1；

Step 3.2：Precompensation parameter vector α is calculated according to below scheme：

Step 3.2.1：Calculated according to following formulaBuild withB × b matrixes for element

It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, c_l′It is the test specimens for randomly selecting This collectionIn a bit, l ' is that the test sample for randomly selecting concentrates sample index；

Step 3.2.2：Calculated according to following formulaBuild withB dimensional vectors for element

It is the vector of b dimensions,It isIn element；

Step 3.2.3：Calculate precompensation parameter vector α：

With α >=0 as constraints, optimization problem is calculatedCalculate following formula and take minimum The value of parameter vector α during value：

WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTurn Put vector；

Step 3.3：Cross validation calculates the optimal Gaussian kernel width of basic function

Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording under Formula calculates the approximating variances of r-th weights of importance under cross validation and expects：

WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, s^trIt isIn a sample, s^teIt isIn a sample,It is sample s^trWeights of importance estimate,It is sample This s^teWeights of importance estimate that its computing formula is as follows：

Wherein α_lIt is l-th element in step 3.2.3 in calculated precompensation parameter vector α；

By default 10 σ values：0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates Under cross validation, the approximating variances of weights of importance is expectedBy minimumIt is worth corresponding σ most preferably high as basic function This core width

Wherein r=1,2 ... R；

Step 4：With α >=0 as constraints, optimization problem is calculatedObtain optimal parameter vector

WhereinL, l ' =1,2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element；

Step 5：Weights of importance β (s) is calculated by following formula：

WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D are The set of training test sample point；

Step 6：Set up weights of importance SVM classifier：

Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, is obtained as follows SVM classifier expression formula：

The SVM classifier expression formula is constituted weights of importance SVM classifier with as foretold constraints：

y_i(<W, d_i>+b)≥1-ξ_i, ξ_i>=0,1≤i≤L

Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, d_iBe by pretreatment it Training sample afterwardsThe characteristic vector of extraction, y_i∈ {+1, -1 } is class label, and they constitute training sample (d₁, y₁), (d₂, y₂) ..., (d_l, y_l), β_iIt is training sample point (d_i, y_i) weights of importance, ξ_iIt is training sample point (d_i, y_i) pine Relaxation variable；

Step 7：The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up Carry out the identification of speech emotional.

Further, in the inventive method, the pretreatment in the step 1 comprises the steps：

Step 1.1：Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the letter of the voice after preemphasis is obtained Number

WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,WithRepresent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the language after preemphasis Message number?Value on individual discrete point, X (- 1)=0；

Step 1.2：Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate F_sTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framingThe speech frame setIn The data of n-th discrete point of the individual speech frames of k ' are：

WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is speech frame sequence Number, K ' is speech frame totalframes, and meets：

RepresentRound downwards；

Step 1.3：To each speech frameSelection window length is that 256 points of Hamming window w is carried out at adding window Reason, obtains adding window speech frame x_k′For：

Wherein x_k′(n)、W (n) represents x respectively_k′、Values of the w on n-th discrete point, length of window is 256 Point Hamming window function be：

Step 1.4：To each adding window speech frame x_k′, 1≤k '≤K ', calculating short-time energy E_k′With short-time zero-crossing rate Z_k′：

Wherein E_k′Represent adding window speech frame x_k′Short-time energy, Z_k′Represent x_k′Short-time zero-crossing rate, x_k′N () is adding window language Sound frame x_k′Value on n-th sampled point, x_k′(n-1) it is x_k′Value on (n-1)th sampled point, sgn [x_k′(n)]、sgn [x_k′(n-1)] x is respectively_k′(n)、x_k′(n-1) sign function, i.e.,：

Wherein λ is the independent variable of above-mentioned sign function；

Step 1.5：Determine short-time energy threshold value t_EWith short-time zero-crossing rate threshold value t_Z：

Wherein K ' is speech frame totalframes；

Step 1.6：To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than Threshold value t_EAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated that efficient voice frame is made For the start frame of the currently active speech frame set, differentiate efficient voice frame as the currently active voice the maximum one-level of frame number The end frame of frame set；

Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, press Differentiated according to the descending order of frame number frame by frame, short-time zero-crossing rate is more than threshold value t_ZAdding window speech frame be labeled as effective language Sound frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value t_ZAdding window speech frame be labeled as efficient voice frame；

The efficient voice frame set obtained after two-stage is differentiated is designated as { p_k}_1≤k≤K, wherein k be efficient voice frame number, K For efficient voice frame totalframes, p_kFor k-th efficient voice frame in efficient voice frame set.

Further, characteristic vector d in the inventive method, in the step 5_iIt is to extract as follows：

With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic is as low order descriptor, right The low order descriptor of sentence carries out statistical computation and obtains statement level feature d_i.

The statistical nature of sentence sample is (as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level Spectral coefficient and set forth herein wavelet packet cepstrum coefficient feature etc.) as low order descriptor (Low Level Descriptor, LLD), statement level characteristic parameter obtained from statistical computation is carried out by all short-time characteristics to sentence.

The statistic that commonly uses in speech emotional feature extraction is listed in Table 1 below：

Table 1

Wherein short-time characteristic is：Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.

Beneficial effect：The present invention compared with prior art, with advantages below：

In existing speech-emotion recognition method, not to existing between training sample and test sample in practical application Covariant skew account for, so as to cause the effect of actual speech emotion recognition application than speech emotional under experimental situation The effect of identification is worse.Weights of importance Modulus Model is set up in the present invention, substantially considers the test in practical application Between sample and training sample exist each species diversity, i.e., between training sample and test sample exist covariant offset into Row quantifies, and is calculated weights of importance factor beta and is the quantized value, and this can intuitively show training sample and test specimens Deviation between this.In the extraction of follow-up speech emotional feature, covariant can be passed through during the foundation of grader inclined Quantized value β is moved, deviation is compensated, so as to significantly exclude because the deviation of speech sample environment generation is for speech emotional is known Other impact.Compared with the speech-emotion recognition method of other deviation compensations, weights of importance Modulus Model is set up, for instruction The deviation that practices between sample and test sample is quantified, and reduces computational complexity and the difficulty of covariant compensation.

Based on weights of importance Modulus Model is set up, in SVM classifier, by increasing weights of importance coefficient, to instruction The deviation that practices between sample and test sample is compensated.Compared with other SVM classifier recognition methodss, this method is in classics Weights of importance is introduced in the object function of SVM classifier, the technology using on-fixed penalty factor is equivalent to, according to importance Weight coefficient, for the big data increase penalty coefficient of weight, so as to be adjusted to Optimal Separating Hyperplane, reduces environmental factorss Impact to speech emotion recognition, improves the accuracy and stability of speech emotion recognition in practical application, than others Standard SVM has more preferable classifying quality.

Description of the drawings

Fig. 1 is the weights of importance SVM training flow chart of the present invention.

Fig. 2 is the weights of importance flow chart of the present invention.

Specific embodiment

With reference to embodiment and Figure of description, the present invention is further illustrated.

The speech emotional characteristic extraction method based on content robust of speaking of the present invention, comprises the following steps：

Step 1：For the sample set of input carries out pretreatment, pretreated input training sample set is obtainedWith Test sample collectionAnd from after pretreatment test sample collectionIn b template point randomly selectingWherein It is training sample is concentrated after pretreatment a sample,It is test sample is concentrated after pretreatment a sample, n_trIt is training Sample set number of samples, n_teIt is test sample collection number of samples, c_lBe fromIn the template point that randomly selects, i is training sample This concentration sample index, j are that test sample concentrates sample index, and l is that the test sample for randomly selecting concentrates sample index.

Wherein pretreatment specifically includes following steps：

Step 1.1：Preemphasis is carried out as the following formula to audio digital signals X, the voice signal after preemphasis is obtained

Step 1.2：Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point with The distance of a later frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate F_sTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained through framing

RepresentRound downwards；

End-point detection is completed using known energy zero-crossing rate dual-threshold judgement method subsequently, is comprised the following steps that：

Wherein K ' is speech frame totalframes；

Step 1.6：To each adding window speech frame, make first order differentiation with short-time energy first, short-time energy value is more than threshold Value t_EAdding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame by the start frame of the currently active speech frame set The end frame of set, then with short-time zero-crossing rate make the second level differentiation, i.e., to the currently active speech frame set, with start frame for Point, differentiates frame by frame according to the descending order of frame number, and short-time zero-crossing rate is more than threshold value t_ZAdding window speech frame be labeled as Efficient voice frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is big In threshold value t_ZAdding window speech frame be labeled as efficient voice frame, the efficient voice frame set obtained after two-stage is differentiated is designated as {s_k}_1≤k≤K, wherein k be efficient voice frame number, K be efficient voice frame totalframes, s_kFor k-th in efficient voice frame set Efficient voice frame.

Step 2：Calculate the optimal Gaussian kernel width of basic function

For training sample data and the degree of closeness of the distribution of test sample data, it is possible to use weights of importance β (s) To represent：

Wherein p_trS () represents training sample set after pretreatmentTraining sample distribution density, p_teS () represents pre- and locates Test sample collection after reasonTest sample distribution density.

Step 2.1：Default basic function Gaussian kernel width σ is set and is respectively 0.1,0.2 ..., 1；

Step 2.2：Calculate precompensation parameter vector α：

β (s) is modeled as by linear model：

α=(α₁, α₂..., α_l) ',It is basic function,S ∈ D, l=1,2 ..., b, b andCan be according to sampleWithDetermine.

Calculate variance J₀(α)：

Above formula last be constant term, can be ignored, represent first two using J (α)：

Transposed vectors of the wherein α ' for vector α, H is the matrix of b × b：H is b dimensions Vector：

The expectation that J (α) is approached using averaging method, the approximating variances for obtaining weights of importance are expected

WhereinIt is the matrix of b × b： It is the vector of b dimensions： It is vectorTransposed vector.

The nonnegativity restrictionss of consideration weights of importance β (x), are converted into optimization problem：

Constraints：α≥0

The optimization problem is calculated, parameter vector α is the optimal solution of the planning problem.

CalculatingWithWhen,Be a core width be σ gaussian kernel function,

WillSubstitute intoWithIn, you can calculate：

WhereinForIn element,ForIn element, l, l '=1,2 ..., b, c_l′Be fromIn random The template point of selection, l ' are that the test sample for randomly selecting concentrates sample index, and σ is 1 in preset value.

Step 2.3：Cross validation calculates optimal basic function Gaussian kernel width

Training sample set after pretreatmentAnd test sample collectionIt is respectively classified into R subsetWithCalculate following formula：

WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is R training sample subset,It is r-th test sample subset,It isSample number,It isSample number, s^trIt isIn a sample, s^teIt isIn a sample,It is sample s^trWeights of importance estimate,It is sample This s^teWeights of importance estimate.

The approximating variances for calculating weights of importance under cross validation is expected

Wherein r=1,2 ... R.

By minimizingTo solve, i.e. following formula obtains optimal solutionAs optimal basic function Gaussian kernel width

Wherein σ is respectively preset value 0.1,0.2 ..., 1.

Step 3：Calculate optimal parameter vector

Using the Gaussian bases obtained in step 2 and optimal basic function Gaussian kernel widthSubstitute into and calculateWithSuch as Following formula：

Wherein l, l '=1,2 ..., b；

Optimization problem is calculated using formula (9) (10)Constraints is α >=0, can be calculated Optimal parameter vector

Step 4：Calculate approximate significance weight

β (x) can be obtained by step 2 to be modeled as by linear modelSubstitute into basic function

Can obtain：

WhereinFor vectorIn element, s be train test sample point in a sample, s ∈ D, D for training test The set of sample point.

Step 5：Set up weights of importance SVM classifier model：

Weights of importance is added in slack variable ξ of standard SVM classifier as coefficient：

Wherein constraints：y_i(<W, d_i>+b)≥1-ξ_i, ξ_i>=0,1≤i≤L, w are the standard vectors of Optimal Separating Hyperplane, | w | is that the mould of w is long, and ξ is slack variable, and C is punishment parameter, d_iIt is by training sampleThe characteristic vector of extraction, y_i∈{+ 1, -1 } it is class label, they constitute training sample (d₁, y₁), (d₂, y₂) ..., (d_l, y_l), β_iIt is training sample point (d_i, y_i) Weights of importance.

The statistics sound that commonly uses in speech emotional feature extraction is listed in Table 1 below：

Table 1

Wherein short-time characteristic is：Fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Mel Correlated Spectroscopy position Put, 90%, 75%, 50%, 25% Mel correlation spectrum roll-off point.Formula (13) and its constraints are weights of importance SVM point Class device model.

Above-described embodiment is only the preferred embodiment of the present invention, it should be pointed out that：Ordinary skill for the art For personnel, under the premise without departing from the principles of the invention, some improvement and equivalent can also be made, these are to the present invention Claim is improved and the technical scheme after equivalent, each falls within protection scope of the present invention.

Claims

1. a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, it is characterised in that the method Comprise the following steps：

Step 2：The sample set of input is divided into training sample setAnd test sample collectionAnd from the test specimens This collectionIn randomly select b template point, c_l, compositionWhereinIt is a sample of the training sample concentration,It is a sample that the test sample is concentrated, n_trIt is that training sample concentrates number of samples, n_teIt is that test sample concentrates sample Number, i are that training sample concentrates sample sequence number, and j is that test sample concentrates sample sequence number, and l is the test sample collection for randomly selecting Middle sample sequence number；

{\hat{H}}_{l, l^{'}} = \frac{1}{n_{t r}} Σ_{i = 1}^{n} \exp (- \frac{| | s_{i}^{t r} - c_{l} | |^{2} + | | s_{i}^{t r} - c_{l^{'}} | |^{2}}{2 σ^{2}}) - - - (1)

It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, c_l′It is the test sample collection for randomly selectingA bit of sum, l ' are that the test sample for randomly selecting concentrates sample index；

{\hat{h}}_{l} = \frac{1}{n_{t e}} Σ_{j = 1}^{n_{t e}} \exp (- \frac{| | s_{j}^{t e} - c_{l} | |^{2}}{2 σ^{2}}) - - - (2)

It is the vector of b dimensions,It isIn element；

Step 3.2.3：Calculate precompensation parameter vector α：

With α >=0 as constraints, optimization problem is calculatedWhen i.e. calculating following formula takes minima The value of parameter vector α：

\hat{J} (α) = \frac{1}{2} α^{'} \hat{H} α - {\hat{h}}^{'} α - - - (3)

WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTransposition to Amount；

Training sample setAnd test sample collectionIt is respectively classified into R subsetWithAccording to following formula meter The approximating variances for calculating r-th weights of importance under cross validation is expected：

{\hat{J}}_{r}^{(C V)} = \frac{1}{2 | S_{r}^{t r} |} \underset{x^{t r} &Element; S_{r}^{t r}}{Σ} {\hat{β}}_{r} {(s^{t r})}^{2} - \frac{1}{| S_{r}^{t e} |} \underset{x^{t e} &Element; S_{r}^{t e}}{Σ} {\hat{β}}_{r} (s^{t e}) - - - (4)

WhereinBe r-th weights of importance under cross validation approximating variances expect, r=1,2 ... R,It is r-th instruction Practice sample set,It is r-th test sample subset,It isSample number,It isSample number, s^trIt isIn A sample, s^teIt isIn a sample,It is sample s^trWeights of importance estimate,It is sample s^te Weights of importance estimate that its computing formula is as follows：

{\hat{β}}_{r} (s^{t e}) = Σ_{l = 1}^{b} α_{l} \exp (- \frac{| | s^{t e} - c_{l} | |^{2}}{2 σ^{2}}) - - - (5)

{\hat{β}}_{r} (s^{t r}) = Σ_{l = 1}^{b} α_{l} \exp (- \frac{| | s^{t r} - c_{l} | |^{2}}{2 σ^{2}}) - - - (6)

By default 10 σ values：0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula respectively calculates intersection The approximating variances of the lower weights of importance of checking is expectedBy minimumDirectly corresponding σ is used as the optimal gaussian kernel of basic function Width

{\hat{J}}^{(C V)} = \frac{1}{R} Σ_{r = 1}^{R} {\hat{J}}_{r}^{(C V)} - - - (7)

Wherein r=1,2 ... R；

WhereinL, l '=1, 2 ..., b, whereinIt is matrixIn the elements that arrange of l row l ',It is one-dimensional vectorIn l-th element；

Step 5：Weights of importance β (s) is calculated by following formula：

β (s) = Σ_{l = 1}^{b} {\hat{α}}_{l} \exp (- \frac{| | s - c_{l} | |^{2}}{2 {\hat{σ}}^{2}}) - - - (8)

WhereinFor optimal parameter vectorIn element, s be train test sample point in a sample, s ∈ D, D for training The set of test sample point；

Step 6：Set up weights of importance SVM classifier：

Weights of importance β (s) is added in slack variable ξ of standard SVM classifier as coefficient, following SVM point is obtained Class device expression formula：

\min_{w} \frac{1}{2} {| w |}^{2} + C Σ_{i = 1}^{L} β_{i} ξ_{i} - - - (9)

The SVM classifier expression formula is constituted weights of importance SVM classifier with following constraints：

y_i(<W, d_i>+b)≥1-ξ_i, ξ_i>=0,1≤i≤L

Wherein, w is the standard vector of Optimal Separating Hyperplane, and | w | is that the mould of w is long, and C is punishment parameter, d_iBe by pretreatment after instruction Practice sampleThe characteristic vector of extraction, y_i∈ {+1, -1 } is class label, and they constitute training sample (d₁, y₁), (d₂, y₂) ..., (d_l, y_l), β_iIt is training sample point (d_i, y_i) weights of importance, ξ_iIt is training sample point (d_i, y_i) lax change Amount；

Step 7：The weights of importance SVM classifier that the characteristic vector that is extracted using the step 1 and the step 6 are set up is carried out The identification of speech emotional.

2. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1, its It is characterised by, the pretreatment in the step 1 comprises the steps：

Step 1.1：Preemphasis is carried out as the following formula to audio digital signals X according to following formula, the voice signal after preemphasis is obtained

\overset{&OverBar;}{X} (\overset{&OverBar;}{n}) = X (\overset{&OverBar;}{n}) - 0.9375 X (\overset{&OverBar;}{n} - 1), 0 \leq \overset{&OverBar;}{n} \leq \overset{&OverBar;}{N} - 1

WhereinThe discrete point sequence number of audio digital signals X is represented,For the length of audio digital signals X,With Represent audio digital signals X the respectivelyWithValue on individual discrete point,Represent the voice signal after preemphasis ?Value on individual discrete point, X (- 1)=0；

Step 1.2：Using overlapping segmentation method to preemphasis after voice signalCarry out framing, previous frame starting point and rear The distance of frame starting point is referred to as frame and moves, and frame pipettes 8ms herein, i.e., in sample rate F_sTake under=16kHz at 128 points, each frame length takes 16ms, that is, take at 256 points,Speech frame set is obtained through framingThe speech frame setMiddle kth ' The data of n-th discrete point of individual speech frame are：

\begin{matrix} {\overset{&OverBar;}{x}}_{k^{,}} (n) = \overset{&OverBar;}{X} (n + 128 (k^{'} - 1)) & 0 \leq n \leq 255, 1 \leq k^{'} \leq K^{'} \end{matrix}

WhereinFor the individual speech frame of kth ' in speech frame set, n represents speech frame discrete point sequence number, and k ' is voice frame number, K ' is speech frame totalframes, and meets：

RepresentRound downwards；

Step 1.3：To each speech frame1≤k '≤K ', selection window length are that 256 points of Hamming window w carries out windowing process, Obtain adding window speech frame x_k′For：

\begin{matrix} x_{k^{'}} (n) = {\overset{&OverBar;}{x}}_{k^{'}} (n) w (n) & 0 \leq n \leq 255, 1 \leq k^{'} \leq K^{'} \end{matrix}

Wherein x_k′(n)、W (n) represents x respectively_k′、Values of the w on n-th discrete point, length of window are 256 points Hamming window function is：

\begin{matrix} w (n) = 0.54 - 0.46 c o s (\frac{2 π n}{255}) & 0 \leq n \leq 255 \end{matrix}

\begin{matrix} E_{k^{'}} = Σ_{n = 0}^{255} x_{k^{'}}^{2} (n) & 1 \leq k^{'} \leq K^{'} \end{matrix}

Z_{k^{'}} = \frac{1}{2} Σ_{n = 1}^{255} | sgn [x_{k^{'}} (n)] - sgn [x_{k^{'}} (n - 1)] |

Wherein E_k′Represent adding window speech frame x_k′Short-time energy, Z_k′Represent x_k′Short-time zero-crossing rate, x_k′N () is adding window speech frame x_k′Value on n-th sampled point, x_k′(n-1) it is x_k′Value on (n-1)th sampled point, sgn [x_k′(n)]、sgn[x_k′(n- 1)] x is respectively_k′(n)、x_k′(n-1) sign function, i.e.,：

sgn [λ] = \{\begin{matrix} 1, & λ &GreaterEqual; 0 \\ - 1, & λ < 0 \end{matrix}

Wherein λ is the independent variable of above-mentioned sign function；

t_{E} = \frac{1}{K^{'}} Σ_{k = 1}^{K^{'}} E_{k^{'}}

t_{Z} = \frac{0.1}{K^{'}} Σ_{k^{'} = 1}^{K^{'}} Z_{k^{'}}

Wherein K ' is speech frame totalframes；

Step 1.6：To each adding window speech frame, make first order differentiation with short-time energy first, will short-time energy value be more than threshold value t_E Adding window speech frame be labeled as one-level and differentiate efficient voice frame, one-level minimum for frame number is differentiated efficient voice frame as current The maximum one-level of frame number is differentiated efficient voice frame as the currently active speech frame set by the start frame of efficient voice frame set End frame；

Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, with start frame as starting point, according to frame The descending order of sequence number differentiates frame by frame, by short-time zero-crossing rate more than threshold value t_ZAdding window speech frame be labeled as efficient voice Frame, and differentiated according to the ascending order of frame number frame by frame with end frame as starting point, short-time zero-crossing rate is more than threshold value t_Z Adding window speech frame be labeled as efficient voice frame；

The efficient voice frame set obtained after two-stage is differentiated is designated as { p_k}_1≤k≤K, wherein k is efficient voice frame number, and K is effective Speech frame totalframes, p_kFor k-th efficient voice frame in efficient voice frame set.

3. the speech-emotion recognition method based on weights of importance support vector machine classifier according to claim 1 and 2, Characterized in that, characteristic vector d in the step 5_iIt is to extract as follows：

With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic as low order descriptor, to sentence Low order descriptor carry out statistical computation and obtain statement level feature d_i.