CN101620853A

CN101620853A - Speech-emotion recognition method based on improved fuzzy vector quantization

Info

Publication number: CN101620853A
Application number: CN200810122806A
Authority: CN
Inventors: 邹采荣; 赵力; 赵艳; 魏昕
Original assignee: Individual
Current assignee: Individual
Priority date: 2008-07-01
Filing date: 2008-07-01
Publication date: 2010-01-06

Abstract

The invention discloses a speech-emotion recognition method based on improved fuzzy vector quantization. The method extends the sum of fuzzy membership function from one to N so as to reduce the influence of sample wild-point on an iteration-training process to a certain extent, and adopts a clustering method based on similarity threshold and a minimum distance principle in the iteration-training process so as to avoid the problem that a clustering center is sensitive to initial values and easy to fall into local minimum values to a certain extent. Experimental results prove that the method can effectively improve the emotion recognition rate of the prior fuzzy vector quantization method.

Description

A kind of based on the speech-emotion recognition method that improves fuzzy vector quantization

Technical field

The present invention relates to a kind of audio recognition method, particularly a kind of speech emotional recognition system and method.

Background technology

The speech emotional automatic identification technology mainly comprises two problems: the one, and which kind of feature in the employing voice signal is as emotion recognition, and the problem of affective characteristics extraction just comprises feature extraction and selection; The one, how with specific voice data classification, the problem of pattern-recognition just comprises various algorithm for pattern recognitions, as arest neighbors, neural network, support vector machine etc.

The affective characteristics of using in the speech emotional identification mainly is prosodic parameter and tonequality parameter, and the former comprises duration, word speed, energy, fundamental frequency and the parameter of deriving thereof, and the latter mainly is resonance peak, harmonic noise ratio and the parameter etc. of deriving thereof.According to three-dimensional emotional space theory, prosodic parameter mainly is to characterize various emotions in the parameter that activates dimension coordinate, and the tonequality parameter then mainly is to characterize emotion at the coordinate of dimension of tiring.For activating dimension coordinate distance emotion far away, prosodic parameter can characterize out otherness preferably; For the dimension coordinate distance emotion far away of tiring activating the dimension coordinate close together, then need tonequality class parameter to strengthen the characterization parameter otherness.There is the problem that accurately detects mostly in present parameter extracting method, and these parameters mainly embody is the glottis of human body and the feature of sound channel, with people's physiological make-up confidential relation is arranged, show stronger diversity on different individualities, this species diversity is especially obvious on different sexes.Before the present invention, in existing various recognition methodss, though the neural network method has the non-linear and extremely strong classification capacity of height, along with the required learning time increase of the increase of network is very fast, the local minimum problem also is a weak point in addition; Hidden Markov method (HMM) is longer on foundation and training time, and being applied to reality also needs to solve the too high problem of computation complexity.Though quadratic discriminant algorithm simple computation amount is little, must be prerequisite with the eigenvector Normal Distribution, influenced discrimination greatly.Then because problem such as quantization error, initial value sensitivity and less use,, but still easily be absorbed in the problem of the responsive and local minimum of initial value based on the recognition methods of vector quantization though fuzzy vector quantization has been alleviated the quantization error problem to a certain extent.

Summary of the invention

Purpose of the present invention just is to overcome the defective of above-mentioned prior art, design, a kind of speech-emotion recognition method based on the improvement fuzzy vector quantization of research.

Technical scheme of the present invention is:

A kind of based on the speech-emotion recognition method that improves fuzzy vector quantization, the steps include:

Set up training, the emotion recognition module of feature extraction analysis module, feature dimensionality reduction module, improvement fuzzy vector quantization module.The feature extraction analysis module comprises that two class Parameter Extraction and sex are regular: prosodic parameter and tonequality parameter.At first, carry out feature extraction then respectively to primary speech signal pre-emphasis, branch frame.

(1) prosodic parameter extracts

(1-1) with primary speech signal through the Hi-pass filter pre-service, extract pronunciation duration, word speed parameter;

(1-2) divide frame, windowing;

(1-3) use the short-time analysis technology, extract each frame statement principal character parameter respectively: fundamental frequency track, short-time energy track, voiced segments voiceless sound section time ratio;

(1-4) parameter of deriving of extraction part prosodic features parameter: short-time energy maximal value, minimum value, average and variance, short-time energy shake maximal value, minimum value, average and variance, fundamental frequency maximal value, minimum value, average and variance, maximal value, minimum value, average and the variance of fundamental frequency shake.What wherein short-time energy was shaken is calculated as follows:

E_{i}^{1} = | E_{i}^{0} - E_{i - 1}^{0} |

I=2,3 ..., N (formula 1)

E wherein _t ⁰Be the short-time energy of i frame, N is a frame number.The calculating of fundamental frequency shake is with (formula 1).

(1-5) sex is regular, according to the different sexes under the sample, is included into different set s _iCalculate average μ separately once more respectively _iAnd variances sigma _i, represent different set numbers with i here, utilize following formula that parameter is regular to identical space;

{s_{i}}^{'} = \frac{s_{i} - u_{i}}{σ_{i}}

(formula 2)

(2) tonequality characteristic parameter extraction

(2-1) maximal value, minimum value, average and the variance of extraction glottis wave parameter, comprise: glottis opening time and whole glottis period ratio (OQ, open quotient), glottis opening process time and closing course time ratio (SQ, speed quotient), glottis closure time and whole glottis period ratio (CQ, ClosedQuotient), glottis closing course time and whole glottis period ratio (ClQ, Closing Quotient), glottis ripple skewness;

(2-2) extract harmonic noise than maximal value, minimum value, average, variance;

(2-3) extract first three resonance peak maximal value, minimum value, average, variance and bandwidth;

(2-4) extract maximal value, minimum value, average, the variance that first three resonance peak is shaken; The resonance peak Jitter Calculation is with (formula 1);

(2-5) sex is regular, with (1-5);

(3) feature dimensionality reduction

(3-1) with behind whole feature extractions and regular the finishing in (1) (2), the composition characteristic vector;

(3-2) adopt principal component analysis neural network (PCANN) to realize dimensionality reduction, obtain sample characteristics vector sequence X={X ₁, X ₂..., X _N, };

(4) improve fuzzy vector quantization

(4-1) to all training samples of certain emotion, calculate the Euclidean distance between any two samples, two nearest samples are decided to be a class, selected distance threshold values L will be judged to this type of with distance all samples within L of one of this two sample;

(4-2) sample and the distance relevant with these samples that have the classification ownership are suitably handled, do not re-used;

(4-3) in remaining sample, find nearest pair of sample, if the distance between them then is decided to be a class respectively with these two samples, and has only a sample in all kinds of greater than L; If the distance between them is less than L, selected distance threshold values α L (0＜α≤1) then will declare with distance all samples within α L of one of this sample and belong to this type of;

(4-4) repeating step (4-2), (4-3) are classified up to all samples, if last only surplus sample then is decided to be a class separately with this sample;

(4-5) adjust L and α L, gathered into the J class up to all samples;

(4-6) with membership function u _k(X _i) normalizing condition expand as

Σ_{j = 1}^{J} Σ_{i = 1}^{N} u_{j} (X_{i}) = N,

Calculate u by (formula 3) _k(X _i), calculate all kinds of class center Y by (formula 4) _j(i=1,2 ... J);

u_{k} (X_{i}) = Σ_{j = 1}^{J} Σ_{i = 1}^{N} {(\frac{d {(X_{i}, Y_{k})}^{2 / (m - 1)}}{Nd {(X_{i}, Y_{j})}^{2 / (m - 1)}})}^{- 1},

1≤k≤J, 1≤i≤N (formula 3)

Y_{k} = \frac{Σ_{i = 1}^{N} u_{k}^{m} (X_{i}) X_{i}}{Σ_{i = 1}^{N} u_{k}^{m} (X_{i})}

1≤k≤J (formula 4)

Wherein m ∈ [1, ∞) be blur level, d (X _i, Y _k) the expression distance;

(4-7) selectivity constant ε＞0 is provided with iterations k=0, as initial codebook, adopts fuzzy C average (FCM) clustering algorithm recursion to go out code book Y with the class center of (4-6) _j(i=1,2 ... J);

(4-8) every kind of emotion is trained a code book by (4-1)～(4-7);

(5) emotion recognition

(5-1) obtain eigenvector X according to step (1) (2) (3) for statement to be identified _i, X _iBe quantized into the vector U (X that forms by membership function _i)={ u ₁, (X _i), u ₂(X _i) ..., u _J(X _i), obtain X _iReconstructed vector

With quantization error D;

{\hat{X}}_{i} = Σ_{k = 1}^{J} u_{k}^{m} Y_{k} / Σ_{k = 1}^{J} u_{k}^{m}

(formula 5)

D = Σ_{k = 1}^{J} u_{k}^{m} (X_{i}) d (X_{i}, Y_{k})

(formula 6)

(5-2) selecting the emotion of that code book correspondence of average quantization distortion minimum is recognition result.

Advantage of the present invention and effect are:

1. by characteristic parameter extraction and analysis, parameter is extended to the tonequality parameter from prosodic parameter, increased the validity of characteristic parameter identification to the emotion statement;

2. adopt the isolated component neural network that the eigenvector that is extracted is carried out dimensionality reduction, not only reduced calculated amount, and played the noise reduction effect to a certain extent;

3. the fuzzy membership function normalizing condition is relaxed, reduce of the influence of wild point code book;

4. adopt clustering method training code book, avoided initial value and local minimum problem based on similarity threshold values and minimal distance principle;

By vector quantization input vector X _iBe quantized into the vector of forming by membership function, rather than certain code word Y _k, be equivalent to increase the size of code book, reduced quantization error.

Other advantages of the present invention and effect will continue to describe below.

Description of drawings

Fig. 1---speech emotional recognition system block diagram.

Fig. 2---affective characteristics extraction and analysis module process flow diagram.

Fig. 3---glottis involves its differentiated waveform figure.

Fig. 4---principal component analysis neural network synoptic diagram.

Fig. 5---the emotion recognition result of fuzzy vector quantization method compares before and after improving.

Embodiment

Below in conjunction with drawings and Examples, technical solutions according to the invention are further elaborated.

As shown in Figure 1, be the native system block diagram, mainly be divided into 4 bulks: feature extraction analysis module, feature dimensionality reduction module, fuzzy vector quantization code book training module and emotion recognition module.The total system implementation is divided into training process and identifying.Training process comprises feature extraction analysis, feature dimensionality reduction and the training of fuzzy vector quantization code book; Identifying comprises feature extraction analysis, feature dimensionality reduction and emotion recognition.

One. affective characteristics extraction and analysis module

1. the prosodic features parameter is selected

The prosodic features parameter comprises: short-time energy maximal value, minimum value, average and variance; Short-time energy shake maximal value, minimum value, average and variance; The maximal value of fundamental frequency, minimum value, average and variance; Maximal value, minimum value, average and the variance of fundamental frequency shake; Voiced segments voiceless sound section time ratio; Word speed.

At first, the characteristic parameter extraction flow process in 2 is carried out pre-emphasis with feature statement to be extracted and is handled with reference to the accompanying drawings, comprises that high-pass filtering, statement begin the detection of end points and end caps; Extract statement pronunciation duration, these two features of word speed of full sentence; Divide the frame windowing to statement then, adopt the short-time analysis technology, according to the gender, obtain each frame fundamental frequency, short-time energy, voiced sound frame number and voiceless sound frame number respectively, each frame gained parameter is gathered, obtain pitch contour, fundamental tone shake track, short-time energy track and the short-time energy shake track of statement respectively, and then obtain their characteristic statistic, and it is regular to carry out sex, obtains above-mentioned whole prosodic features parameter.

2. the tonequality characteristic parameter is selected

The tonequality characteristic parameter comprises: the maximal value of OQ, minimum value, average and variance; The maximal value of SQ, minimum value, average and variance; The maximal value of CQ, minimum value, average and variance; The maximal value of ClQ, minimum value, average and variance; R _kMaximal value, minimum value, average and variance; The first resonance peak maximal value, minimum value, average, variance and bandwidth; Maximal value, minimum value, average, the variance of the shake of first resonance peak; The second resonance peak maximal value, minimum value, average, variance and bandwidth; Maximal value, minimum value, average, the variance of the shake of second resonance peak; The 3rd resonance peak maximal value, minimum value, average, variance and bandwidth; Maximal value, minimum value, average, the variance of the shake of the 3rd resonance peak; Harmonic noise is than maximal value, minimum value, average, variance.

Choosing of a plurality of tonequality parameters is that the present invention proposes one of characteristics of method.Though prosodic features plays a leading role in identification, some activates when tieing up the emotion of separating near the dimension of tiring in identification, and as glad and angry, the tonequality feature can play effective supplementary function.The variation of glottal waveform shape when the tonequality parameter is the reflection pronunciation, its influence factor has muscle tone, sound channel central authorities' pressure and sound channel length tension force, concrete sound Source Type (articulation type), glottis wave parameter and sound channel formant parameter etc.LF model (Liljencrants-Fant Mode) is the model of the description glottis ripple used always, as shown in Figure 3, and T wherein ₀: pitch period; t _o: glottis is opened constantly; t _c: glottis closing moment; t _p: the glottis ripple reaches peak-peak constantly; t _e: the difference ripple reaches maximum negative peak constantly.Can extract following glottis wave parameter according to this model:

OQ = \frac{t_{c} - t_{o}}{T_{0}}

(formula 7)

SQ = \frac{t_{p} - t_{o}}{t_{c} - t_{p}}

(formula 8)

CQ = \frac{T_{0} - t_{c} + t_{o}}{T_{0}} = 1 - OQ

(formula 9)

CIQ = \frac{t_{c} - t_{p}}{T_{0}}

(formula 10)

R_{k} = \frac{t_{p}}{t_{e} - t_{p}}

(formula 11)

During concrete enforcement, need that still the emotion statement is carried out pre-emphasis and handle, comprise that high-pass filtering, statement begin the detection of end points and end caps; Divide the frame windowing to statement then, obtain tonequality parameters such as glottis wave characteristic, resonance peak feature, harmonic noise compare respectively, and it is regular to carry out sex, finally be used for the tonequality characteristic parameter of code book training or identification.

In the implementation of system, the feature extraction analysis is absolutely necessary.In training process, the feature extraction analysis of training sample can directly be carried out according to flow process shown in Figure 2.In identifying, the feature extraction of statement to be identified is analyzed and is carried out according to Fig. 2 flow process equally.

Two. the feature dimensionality reduction

Preceding surface analysis has extracted totally 69 characteristic parameters, for avoiding the too high computation complexity that causes of dimension to promote, and redundant information adopts the isolated component neural network to realize dimensionality reduction to the influence of identification, employing is based on the linear unsupervised learning neural network of Hebb rule, as shown in Figure 4.By study to weight matrix W, make weight vector approach the pairing proper vector of eigenwert in the oblique variance battle array of proper vector x, avoid directly inversion operation to matrix.Obtain eigenvector y=W behind the dimensionality reduction ^TX.It is as follows that weight vector is revised rule:

w _j[k+1]=w _j[k]+η (y _j[k] x ' [k]-y _j ²[k] w _j[k]) (formula 12)

x^{'} [k] = x [k] - Σ_{i = 1}^{j - 1} w_{i} [k] y_{i} [k]

(formula 13)

Three. improve the training of fuzzy vector quantization code book

The traditional fuzzy vector quantization is to adopt the fuzzy clustering algorithm to replace the design of K mean algorithm to quantize a kind of method of code book, can reduce the quantization error of code book to a certain extent, wild point disturbs, initial value is responsive and local minimization problem but still exist, for this reason, the present invention proposes a kind of improvement fuzzy vector quantization method, and concrete steps are as follows:

1. to all training characteristics samples of a certain emotion, calculate the Euclidean distance between any two samples, two nearest samples are decided to be a class, selected distance threshold values L will be judged to this type of with distance all samples within L of one of this two sample;

2. the sample and the distance relevant with these samples that have the classification ownership are suitably handled, do not re-used;

3. in remaining sample, find nearest pair of sample, if the distance between them then is decided to be a class respectively with these two samples, and has only a sample in all kinds of greater than L; If the distance between them is less than L, selected distance threshold values α L (0＜α≤1) then will declare with distance all samples within α L of one of this sample and belong to this type of;

4. repeating step 2,3, all are classified up to all samples, if last only surplus sample then is decided to be a class separately with this sample;

5. adjust L and α L, gathered into the J class up to all samples;

6. calculate membership function u according to (formula 3) _k(X _i), with u _k(X _i) normalizing condition expand as

Σ_{j = 1}^{J} Σ_{i = 1}^{N} u_{j} (X_{i}) = N,

This also is one of characteristics of the present invention, and calculates all kinds of class center Y by (formula 4) _j(i=1,2 ... J);

7. selectivity constant ε＞0 is provided with iterations k=0, as initial codebook, adopts fuzzy C mean algorithm recursion to go out code book Y with result in 6 _j(i=1,2 ... J);

8. every kind of emotion is trained a code book respectively by 1～7.

Four. the emotion recognition module

Emotion statement for to be identified extracts its eigenvector according to Fig. 2 flow process, utilizes the principal component analysis neural network to carry out dimensionality reduction then, obtains X _iWith X _iThe code book of corresponding every kind of emotion carries out vector quantization, X _iBe quantized into the vector U (X that forms by membership function _i)={ u ₁, (X _i), u ₂(X _i) ..., u _J(X _i), obtain X _iReconstructed vector

With quantization error D; Selecting the emotion of that code book correspondence of average quantization distortion minimum is recognition result.

Five. the evaluation of recognition system

Because it is N that the degree of membership summation is expanded by 1, reduced the influence of sample wild-point to a certain extent to the training iterative process, in code book instruction amount process, adopt clustering method based on similarity threshold values and minimal distance principle, avoided to a certain extent cluster centre to the initial value sensitivity, easily be absorbed in the problem of local minimum, result from two kinds of emotion identification methods of Fig. 5, its recognition effect obtains bigger improvement, angry discrimination has improved 12.3%, sad discrimination has improved 5.1%, glad discrimination has improved 5.9%, surprised discrimination has improved 14.9%, and the inventive method is discerned speech emotional and is much higher than existing additive method.

The scope that the present invention asks for protection is not limited only to the description of this embodiment.

Claims

1. the speech-emotion recognition method based on the improvement fuzzy vector quantization the steps include:

Set up training, the emotion recognition module of feature extraction analysis module, feature dimensionality reduction module, improvement fuzzy vector quantization module; The feature extraction analysis module comprises that two class Parameter Extraction and sex are regular: prosodic parameter and tonequality parameter; At first, carry out feature extraction then respectively to primary speech signal pre-emphasis, branch frame;

(1) prosodic parameter extracts

(1-2) divide frame, windowing;

(1-4) parameter of deriving of extraction part prosodic features parameter: short-time energy maximal value, minimum value, average and variance, short-time energy shake maximal value, minimum value, average and variance, fundamental frequency maximal value, minimum value, average and variance, maximal value, minimum value, average and the variance of fundamental frequency shake; What wherein short-time energy was shaken is calculated as follows:

E_{i}^{1} = | E_{i}^{0} - E_{i - 1}^{0} |

I=2,3 ..., N (formula 1)

E wherein _i ⁰Be the short-time energy of i frame, N is a frame number; The calculating of fundamental frequency shake is with (formula 1);

{s_{i}}^{'} = \frac{s_{i} - u_{i}}{σ_{i}}

(formula 2)

(2) tonequality characteristic parameter extraction

(2-5) sex is regular, with (1-5);

(3) feature dimensionality reduction

(4) improve fuzzy vector quantization

(4-5) adjust L and α L, gathered into the J class up to all samples;

(4-6) with membership function u _k(X _i) normalizing condition expand as

Σ_{j = 1}^{J} Σ_{i = 1}^{N} u_{j} (X_{i}) = N,

Calculate u by (formula 3) _k(X _i),

Calculate all kinds of class center Y by (formula 4) _j(i=1,2 ... J);

u_{k} (X_{i}) = Σ_{j = 1}^{J} Σ_{i = 1}^{N} {(\frac{d {(X_{i}, Y_{k})}^{2 / (m - 1)}}{Nd {(X_{i}, Y_{j})}^{2 / (m - 1)}})}^{- 1},

1≤k≤J, 1≤i≤N (formula 3)

Y_{k} = \frac{Σ_{i = 1}^{N} u_{k}^{m} (X_{i}) X_{i}}{Σ_{i = 1}^{N} u_{k}^{m} (X_{i})}

1≤k≤J (formula 4)

Wherein m ∈ [1, ∞) be blur level, d (X _i, Y _k) the expression distance;

(4-8) every kind of emotion is trained a code book by (4-1)～(4-7);

(5) emotion recognition

With quantization error D;

{\hat{X}}_{i} = Σ_{k = 1}^{J} u_{k}^{m} Y_{k} / Σ_{k = 1}^{J} u_{k}^{m}

(formula 5)

D = Σ_{k = 1}^{J} u_{k}^{m} (X_{i}) d (X_{i}, Y_{k})

(formula 6)