CN103021406B - Robust speech emotion recognition method based on compressive sensing - Google Patents

Robust speech emotion recognition method based on compressive sensing Download PDF

Info

Publication number
CN103021406B
CN103021406B CN201210551585.7A CN201210551585A CN103021406B CN 103021406 B CN103021406 B CN 103021406B CN 201210551585 A CN201210551585 A CN 201210551585A CN 103021406 B CN103021406 B CN 103021406B
Authority
CN
China
Prior art keywords
formula
test
speech
sample
rarefaction representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210551585.7A
Other languages
Chinese (zh)
Other versions
CN103021406A (en
Inventor
赵小明
张石清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taizhou University
Original Assignee
Taizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taizhou University filed Critical Taizhou University
Priority to CN201210551585.7A priority Critical patent/CN103021406B/en
Publication of CN103021406A publication Critical patent/CN103021406A/en
Application granted granted Critical
Publication of CN103021406B publication Critical patent/CN103021406B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a robust speech emotion recognition method based on compressive sensing. The recognition method includes generating a noisy emotion speech sample, establishing an acoustic feature extraction module, constructing a sparse representation classifier model, and outputting a speech emotion recognition result. The robust speech emotion recognition method has the advantages that effects of noise on emotion speech in the natural environment are fully considered, and the robust speech emotion recognition method under the noise background is provided; validity of feature parameters in different types is fully considered, extraction of feature parameters is extended to the Mel frequency cepstrum coefficient (MFCC) from prosodic and tone features, and anti-noise effects of the feature parameters are further improved; and the high-performance robust speech emotion recognition method based on the compressive sensing theory is provided through the sparse representation distinguishing in the compressive sensing theory.

Description

Robustness speech emotion identification method based on compressed sensing
Technical field
The present invention relates to speech processes, area of pattern recognition, particularly relate to a kind of robustness speech emotion identification method based on compressed sensing.
Background technology
The mankind's language has not only comprised text symbol information, is also carrying the information such as people's emotion and mood simultaneously.How to allow computing machine pass through voice signal automatic analysis and judgement speaker's affective state, i.e. so-called " speech emotional identification " research of aspect has become the focus in the fields such as speech processes, pattern-recognition.The final purpose of this research will be given computing machine emotion intelligence exactly, makes computing machine as people, can carry out nature, cordiality and vivo mutual.This research has important using value in fields such as artificial intelligence, Robotics, natural human-computer interaction technologies.
At present, for the research of speech emotional identification, be to using the emotion language material recorded in quiet environment as sentiment analysis and research object substantially.Yet the emotional speech in physical environment all can be subject to the interference of noise conventionally, comprised noise in various degree.Therefore, the more approaching reality of research for the robustness speech emotion recognition aspect under noise background, has more using value.But for the robustness speech emotion recognition research under noise background, the Research Literature of this respect is very few at present.
Speech emotional automatic identification technology mainly comprises two problems: the one, and which kind of effective speech characteristic parameter is affective feature extraction problem, extract for emotion recognition; The 2nd, emotion identification method problem, adopt which kind of effective mode identification method to classify and (see patent: Zou Cairong, a kind of speech-emotion recognition method-application number/patent No. based on support vector machine: 2006100973016) the emotion classification under the statement that comprises certain emotion.
At present, aspect affective feature extraction, in speech emotional identification, conventional affective characteristics parameter is prosodic features and tonequality feature, and the former comprises fundamental frequency, amplitude and pronunciation duration, and the latter comprises resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter etc.But the antinoise effect that these characteristic parameters itself show is very limited.Therefore, only use prosodic features and tonequality feature, be difficult to obtain good speech emotional recognition performance under noise background.In order to improve the antinoise effect of characteristic parameter, be necessary that the characteristic parameter that extracts other type is as spectrum signature, itself and prosodic features and tonequality feature are merged mutually.A kind of representational spectrum signature is exactly the Mel frequency cepstral coefficient (MFCC) that can reflect human hearing characteristic.
Aspect emotion identification method, the existing speech emotional knowledge method for distinguishing that has been successfully applied to mainly comprises: linear discrimination classification device (LDC), k nearest neighbor method (KNN), artificial neural network (ANN) and support vector machine (SVM).But these recognition methodss are more responsive to noise ratio, be difficult to obtain good robustness speech emotion recognition performance.Therefore, be necessary to develop new high performance robustness speech emotion identification method.
Introduce again compressed sensing (CS) technology.
Compressed sensing (CS) (is shown in document: E. J. Candes, M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 2008, 25 (2): 21-30) as a kind of brand-new signal, process and sampling theory, its core concept is, as long as signal can compress, or be sparse at certain transform domain, just can adopt one with the incoherent observing matrix of transform-based, the resulting high dimensional signal of conversion to be projected on a lower dimensional space, then by solving an optimization problem, just can in the middle of these a small amount of projections, with high probability, reconstruct original signal.Under this theoretical frame, sampling rate is no longer decided by the bandwidth of signal, and is decided by structure and the content of information in signal.
The original intention of compressed sensing (CS) research is compression and the expression for signal, but its most sparse expression has good identification, can be used for building sorter and (see document: Guha T, Ward RK. Learning Sparse Representations for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012,34 (8): 1576-1588.).At present, in existing speech emotional Study of recognition document, yet there are no and adopt the identification of the rarefaction representation in compressive sensing theory as the robustness recognition methods of speech emotional identification.The present invention utilizes the identification of the rarefaction representation in compressive sensing theory to realize the robustness speech emotion recognition under noise background.
Summary of the invention
Object of the present invention is exactly in order to overcome the deficiency of above-mentioned existing emotion recognition technology, and a kind of robustness speech emotion identification method based on compressed sensing is provided, for realizing the robustness speech emotion recognition under noise background.
The technical solution adopted in the present invention is:
A robustness speech emotion identification method based on compressed sensing, the method comprises following steps:
Produce Noise emotional speech sample, set up acoustic feature extraction module, build rarefaction representation sorter model, output speech emotional recognition result;
(1) produce the emotional speech sample of Noise, comprising:
By all speech samples of emotional speech Sample Storehouse, be divided into training sample and test sample book two parts, then each training sample and test sample book are added to white Gaussian noise, thereby produce the emotional speech sample of Noise;
(2) set up acoustic feature extraction module, comprising:
The emotional speech sample of Noise is carried out to acoustic feature extraction, and this acoustic feature extraction module comprises three parts: prosodic features parameter extraction, tonequality characteristic parameter extraction, Mel frequency cepstral coefficient MFCC extract;
(2-1) prosodic features parameter extraction, comprising: fundamental frequency, amplitude and pronunciation duration;
(2-2) tonequality characteristic parameter extraction, comprising: resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter;
(2-3) Mel frequency cepstral coefficient MFCC extracts, and comprising: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation;
(3) build rarefaction representation sorter model, comprising:
By acoustic feature extraction module, each emotional speech sample corresponding an eigenvector being formed by the acoustical characteristic parameters of extracting; The corresponding eigenvector of all emotional speech samples is all input in rarefaction representation sorter, for building rarefaction representation sorter model;
The method that builds rarefaction representation sorter is, first adopt the method for Its Sparse Decomposition, with training sample, test sample book is carried out to rarefaction representation, training sample is seen as to one group of base, by solving the method for 1-Norm minimum, obtain the rarefaction representation coefficient of test sample book, finally by the residual error after test sample book and rarefaction representation, classify;
(4) output speech emotional recognition result, comprising:
By the training and testing of rarefaction representation sorter, output speech emotional recognition result, in emotion recognition test, adopt crosscheck technology 10 times, to be divided equally be 10 parts to all statements, each 9 piece of data wherein of using are used for training, 1 remaining piece of data is for test, and the corresponding repetition of such identification experimentation 10 times, finally gets the mean value of 10 times as recognition result.
Described fundamental frequency adopts correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient;
Described amplitude adopts a square summation approach to ask for, and extracts 9 statistics parameters that amplitude is relevant, comprises mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values;
The described pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
Described resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters;
Described frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz 500, 500-1000Hz frequency band energy mean value SED 1000, 1000-2500Hz frequency band energy mean value SED 2500, 2500-4000Hz frequency band energy mean value SED 4000, 4000-5000Hz frequency band energy mean value SED 5000;
Described harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
HNR = 10 log 10 [ Σ i = 1 N h ( i ) 2 / Σ i = 1 N n ( i ) 2 ] (formula 1)
Described short-term jitter parameter: comprise jitter Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude, can obtain by calculating the slope variation of fundamental curve and amplitude curve;
The computing formula of jitter Jitter is defined as:
Jitter ( % ) = Σ i = 1 N - 1 ( 2 T i - T i - 1 - T i + 1 ) / Σ i = 2 N - 1 T i (formula 2)
In formula, T irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase;
The computing formula of Shimmer Shimmer is defined as:
Shimmer ( % ) = Σ i = 2 N - 1 ( 2 E i - E i - 1 - E i + 1 ) / Σ i = 2 N - 1 E i (formula 3)
In formula, E irepresent i peak-to-peak energy.
The method of described structure rarefaction representation sorter, concrete steps are as follows:
The training sample of given a certain class, test sample book is seen the linear combination of similar training sample as,
y k , test = α k , 1 y k , 1 + α k , 2 y k , 2 + . . . + α k , n k y k , n k + ϵ k = Σ i = 1 n k α k , i y k , i + ϵ k (formula 1)
In formula, y k, testrepresent k ththe test sample book of class, y k,irepresent k ththe i of class thindividual training sample, α k,ithe weight vector that represents corresponding training sample, ε krepresent error;
For other training sample of all target class, (formula 1) can be expressed as:
y k , test = α 1,1 y 1,1 + . . . + α k , 1 y k , 1 + . . . + α k , n k y k , n k + . . . + α c , n c y c , n c + ϵ = Σ i = 1 n 1 α 1 , i y 1 , i + . . . + Σ i = k n k α k , i y k , i + . . . + α c , i y c , i + ϵ (formula 2)
In formula, c represents total classification number of all training samples;
Adopt the form of matrix to represent (formula 2),
Y k, test=A α+ε (formula 3)
Wherein
A = [ y 1,1 | . . . | y 1 , n 1 | . . . | y k , 1 | . . . | y k , n k | . . . | y c , 1 | . . . | y c , n c ] α = [ α 1,1 . . . α 1 , n 1 . . . α k , 1 . . . α k , n k . . . α c , 1 . . . α c , n c ] ′ (formula 4)
In rarefaction representation sorter, in claim vector α except with k thoutside the relevant element of class, remaining element should be all zero, in order to obtain weight vector α, need to solve the optimization problem under L-0 norm meaning below:
min α | | α | | 0 , s . t . | | y k , test - Aα | | 2 ≤ ϵ (formula 5)
For solving (formula 5), L-0 norm optimization problem is converted into L-1 norm duty Optimization:
min α | | α | | 1 , s . t . | | y k , test - Aα | | 2 ≤ ϵ (formula 6)
This is a protruding optimization problem, can be converted into linear programming problem and solve;
In order further to improve the noise robustness of rarefaction representation, the L-1 norm optimization problem of a weighting of design, (formula 6) can be expressed as
min α | | α | | 1 , s . t . | | W ( y k , test - Aα ) | | 2 ≤ ϵ (formula 7)
Wherein, weight factor variable W can be expressed as
W i = e - | | y - y recons ( i ) | | 2 2 σ 2 (formula 8)
In formula, σ is a constant, y recons(i)=A α irepresent that one based on weight vector α ireconstructed sample, wherein, constant σ is made as 1, for the larger data of noise ratio, residual values || y test-y test(i) || 2will be larger, its corresponding weight factor variable can be smaller, thus impact that can attenuating noise;
A given new test sample y test, first by solving (formula 7), obtain weight vector α, if the corresponding k of coefficient value of maximum in the middle of the nonzero coefficient of weight vector α thclass, and by y testbe included in the middle of this classification, or by y testbe included in the middle of the corresponding classification of coefficient value maximum in weight vector α.
5, the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
The training and testing of described rarefaction representation sorter, comprises the following steps:
(4-1) with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of a given class emotion, obtains its weight vector α by solving the L-1 norm optimization problem of (formula 7);
(4-2) to each class emotion (i=1,2 ..., 7) test sample y test, be first similar to and reconstruct a new samples, be designated as: , then calculate new samples and the y of this reconstruct testresidual error, i.e. r (y test, i)=|| y test-y test(i) || 2;
(4-3) getting residual error is that the classification number i of minimum value is as test sample y testemotion classification, i.e. identify (y test)=arg min ir(y test, i), thus the recognition result of output different emotions classification.
In described emotional speech Sample Storehouse, choose anger, happiness, sadness, fear, dislike, be sick of and seven kinds of emotional speech samples of ameleia.
Beneficial effect effect of the present invention is:
1. the emotional speech fully taking into account in physical environment can be subject to the impact of noise conventionally, and the emotion identification method of the robustness speech under a kind of noise background is provided.
2. fully take into account the validity of dissimilar characteristic parameter, the extraction of characteristic parameter, from prosodic features and tonequality feature two aspects, is extended to Mel frequency cepstral coefficient MFCC, further improve the antinoise effect of characteristic parameter.
3. utilize the identification of the rarefaction representation in compressive sensing theory, a kind of high performance robustness speech emotion identification method based on compressive sensing theory is provided.
Accompanying drawing explanation
Fig. 1---speech emotional recognition system block diagram.
Fig. 2---the statistics of emotion acoustic feature parameter.
Fig. 3---under different signal to noise ratio snr, the obtained speech emotional recognition performance (%) of distinct methods compares.
Fig. 4---the correct recognition rata (%) of obtained different emotions type when the inventive method behaves oneself best.
Embodiment
Fig. 1 is native system block diagram, mainly comprises two major parts: acoustic feature extracts, the training and testing of rarefaction representation sorter.
One, acoustic feature extracts
From German emotional speech Sample Storehouse, Berlin(is shown in document: Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B. A database of German emotional speech. In:Proceedings of. Interspeech-2005, Lisbon, Portugal, 2005, pp. 1-4.), choose anger, happiness, sadness, fear, dislike, be sick of and neutral (ameleia) seven kinds of emotional speech samples, totally 535.Each emotional speech sample of choosing is added to white Gaussian noise, and through pre-emphasis, minute frame and windowing pre-service, wherein frame length is 10ms.Then extract the acoustical characteristic parameters of three aspects: prosodic features, tonequality feature and Mel frequency cepstral coefficient MFCC.Fig. 2 has provided the statistical conditions of the emotion acoustic feature parameter of these three aspects of extracting, 204 altogether.The concrete condition of these characteristic parameter extraction, is expressed as follows:
1. prosodic features parameter extraction: comprise fundamental frequency, amplitude and pronunciation duration.
(1-1) fundamental frequency: adopt correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient.
(1-2) amplitude: adopt a square summation approach to ask for, extract 9 statistics parameters that amplitude is relevant, comprise mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values.
(1-3) the pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
2. tonequality characteristic parameter extraction: comprise that resonance peak, frequency band energy distribute, harmonic noise ratio, and short-term jitter parameter.
(2-1) resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters.The criterion of approaching of Burger Burg method is exactly to make forward and reverse prediction square error sum of lattice filter minimum.(see document: Erkelens JS, Broersen PMT. Bias propagation in the autocorrelation method of linear prediction[J]. IEEE Transactions on Speech and Audio Processing, 1997,5 (2): 116-119.)
(2-2) frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz 500, 500-1000Hz frequency band energy mean value SED 1000, 1000-2500Hz frequency band energy mean value SED 2500, 2500-4000Hz frequency band energy mean value SED 4000, 4000-5000Hz frequency band energy mean value SED 5000.
(2-3) harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
HNR = 10 log 10 [ Σ i = 1 N h ( i ) 2 / Σ i = 1 N n ( i ) 2 ] (formula 1)
(2-4) short-term jitter parameter: comprise that (Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude to jitter, can obtain by calculating the slope variation of fundamental curve and amplitude curve.
The computing formula of jitter (Jitter) is defined as:
Jitter ( % ) = Σ i = 1 N - 1 ( 2 T i - T i - 1 - T i + 1 ) / Σ i = 2 N - 1 T i (formula 2)
In formula, T irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase.
The computing formula of Shimmer (Shimmer) is defined as:
Shimmer ( % ) = Σ i = 2 N - 1 ( 2 E i - E i - 1 - E i + 1 ) / Σ i = 2 N - 1 E i (formula 3)
In formula, E irepresent i peak-to-peak energy.
3. Mel frequency cepstral coefficient MFCC: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation.
Two, the training and testing of rarefaction representation sorter
The training and testing step of rarefaction representation sorter comprises:
1. with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of given a certain class emotion, obtains its weight vector by solving the L-1 norm optimization problem of (formula 7).
To each class emotion (i=1,2 ..., 7) test sample y test, be first similar to and reconstruct a new samples, be designated as: , then calculate new samples and the y of this reconstruct testresidual error, i.e. r (y test, i)=|| y test-y recons(i) || 2.
3. get residual error and be the classification number i of minimum value as test sample y testemotion classification, i.e. identify (y test)=arg min ir(y test, i), thus the recognition result of output different emotions classification.
Three, the evaluation of recognition system
In order to improve the confidence level of test result, in emotion recognition test, adopt crosscheck technology 10 times.
Fig. 3 has provided the inventive method and other four kinds of recognition methodss in the situation that of different signal to noise ratio (snr), as linear discrimination classification device (LDC), k nearest neighbor method (KNN), artificial neural network (ANN) and support vector machine (SVM), obtained speech emotional recognition performance (%) relatively.The value of signal to noise ratio (snr), before this in noiseless (the acoustic feature data that directly the emotion statement from Berlin database extracts) situation, and then from 30dB, reduces 5dB successively, until-10dB cut-off.This result shows, uses the inventive method obtained speech emotional recognition performance under various signal to noise ratio (S/N ratio) conditions all will obviously be better than other four kinds of recognition methodss.Visible, use the inventive method can obtain excellent robustness speech emotion recognition performance.In addition,, under noise-free case, the inventive method has also obtained best recognition performance.When Fig. 4 has provided use the inventive method and has behaved oneself best, i.e. the correct recognition rata (%) of obtained different emotions type under noise-free case.Wherein, the correct recognition rata that in Fig. 4, each concrete affective style of diagonal line runic data representation obtains.

Claims (5)

1. the robustness speech emotion identification method based on compressed sensing, is characterized in that, the method comprises following steps:
Produce Noise emotional speech sample, set up acoustic feature extraction module, build rarefaction representation sorter model, output speech emotional recognition result;
(1) produce the emotional speech sample of Noise, comprising:
By all speech samples of emotional speech Sample Storehouse, be divided into training sample and test sample book two parts, then each training sample and test sample book are added to white Gaussian noise, thereby produce the emotional speech sample of Noise;
(2) set up acoustic feature extraction module, comprising:
The emotional speech sample of Noise is carried out to acoustic feature extraction, and this acoustic feature extraction module comprises three parts: prosodic features parameter extraction, tonequality characteristic parameter extraction, Mel frequency cepstral coefficient MFCC extract;
(2-1) prosodic features parameter extraction, comprising: fundamental frequency, amplitude and pronunciation duration;
(2-2) tonequality characteristic parameter extraction, comprising: resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter;
(2-3) Mel frequency cepstral coefficient MFCC extracts, and comprising: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation;
(3) build rarefaction representation sorter model, comprising:
By acoustic feature extraction module, each emotional speech sample corresponding an eigenvector being formed by the acoustical characteristic parameters of extracting; The corresponding eigenvector of all emotional speech samples is all input in rarefaction representation sorter, for building rarefaction representation sorter model;
The method that builds rarefaction representation sorter is, first adopt the method for Its Sparse Decomposition, with training sample, test sample book is carried out to rarefaction representation, training sample is seen as to one group of base, by solving the method for 1-Norm minimum, obtain the rarefaction representation coefficient of test sample book, finally by the residual error after test sample book and rarefaction representation, classify;
The method of described structure rarefaction representation sorter, concrete steps are as follows:
The training sample of given a certain class, test sample book is seen the linear combination of similar training sample as,
(formula 1)
In formula, y k, testrepresent k ththe test sample book of class, y k,irepresent k ththe i of class thindividual training sample, α k,ithe weight vector that represents corresponding training sample, ε krepresent error;
For other training sample of all target class, (formula 1) can be expressed as:
(formula 2)
In formula, c represents total classification number of all training samples;
Adopt the form of matrix to represent (formula 2),
Y k, test=A α+ε (formula 3)
Wherein
(formula 4)
In theory, in rarefaction representation sorter, in claim vector α except with k thoutside the relevant element of class, remaining element should be all zero; In order to obtain weight vector α, need to solve the optimization problem under L-0 norm meaning below:
(formula 5)
For solving (formula 5), L-0 norm optimization problem is converted into L-1 norm duty Optimization:
(formula 6)
This is a protruding optimization problem, can be converted into linear programming problem and solve;
In order further to improve the noise robustness of rarefaction representation, the L-1 norm optimization problem of a weighting of design, (formula 6) can be expressed as:
(formula 7)
Wherein, weight factor variable W can be expressed as:
(formula 8)
In formula, σ is a constant, y recons(i)=A α irepresent that one based on weight vector α ireconstructed sample, wherein, constant σ is made as 1, for the larger data of noise ratio, residual values || y-y recons(i) || 2will be larger, its corresponding weight factor variable can be smaller;
A given new test sample y test, first by solving (formula 7), obtain weight vector α, if the corresponding k of coefficient value of maximum in the middle of the nonzero coefficient of weight vector α thclass, and by y testbe included in the middle of this classification, or by y testbe included in the middle of the corresponding classification of coefficient value maximum in weight vector α;
(4) output speech emotional recognition result, comprising:
By the training and testing of rarefaction representation sorter, output speech emotional recognition result, in emotion recognition test, adopt crosscheck technology 10 times, to be divided equally be 10 parts to all statements, each 9 piece of data wherein of using are used for training, 1 remaining piece of data is for test, and the corresponding repetition of such identification experimentation 10 times, finally gets the mean value of 10 times as recognition result.
2. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
Described fundamental frequency adopts correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient;
Described amplitude adopts a square summation approach to ask for, and extracts 9 statistics parameters that amplitude is relevant, comprises mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values;
The described pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
3. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that,
Described resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters;
Described frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz 500, 500-1000Hz frequency band energy mean value SED 1000, 1000-2500Hz frequency band energy mean value SED 2500, 2500-4000Hz frequency band energy mean value SED 4000, 4000-5000Hz frequency band energy mean value SED 5000;
Described harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
(formula 1)
Described short-term jitter parameter: comprise jitter Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude, can obtain by calculating the slope variation of fundamental curve and amplitude curve;
The computing formula of jitter Jitter is defined as:
(formula 2)
In formula, T irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase;
The computing formula of Shimmer Shimmer is defined as:
(formula 3)
In formula, E irepresent i peak-to-peak energy.
4. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
The training and testing of described rarefaction representation sorter, comprises the following steps:
(4-1) with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of a given class emotion, obtains its weight vector α by solving the L-1 norm optimization problem of (formula 7);
(4-2) test sample y to each class emotion (i=1,2, L, 7) test, be first similar to and reconstruct a new samples, be designated as: then calculate new samples and the y of this reconstruct testresidual error, i.e. r (y test, i)=|| y test-y recons(i) || 2;
(4-3) getting residual error is that the classification number i of minimum value is as test sample y testemotion classification, i.e. identify (y test)=arg min ir(y test, i), thus the recognition result of output different emotions classification.
5. the robustness speech emotion identification method based on compressed sensing as described in claim 1-4 any one, it is characterized in that, in described emotional speech Sample Storehouse, choose anger, happiness, sadness, fear, dislike, be sick of and seven kinds of emotional speech samples of ameleia.
CN201210551585.7A 2012-12-18 2012-12-18 Robust speech emotion recognition method based on compressive sensing Expired - Fee Related CN103021406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210551585.7A CN103021406B (en) 2012-12-18 2012-12-18 Robust speech emotion recognition method based on compressive sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210551585.7A CN103021406B (en) 2012-12-18 2012-12-18 Robust speech emotion recognition method based on compressive sensing

Publications (2)

Publication Number Publication Date
CN103021406A CN103021406A (en) 2013-04-03
CN103021406B true CN103021406B (en) 2014-10-22

Family

ID=47969938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210551585.7A Expired - Fee Related CN103021406B (en) 2012-12-18 2012-12-18 Robust speech emotion recognition method based on compressive sensing

Country Status (1)

Country Link
CN (1) CN103021406B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345923B (en) * 2013-07-26 2016-05-11 电子科技大学 A kind of phrase sound method for distinguishing speek person based on rarefaction representation
US9412373B2 (en) * 2013-08-28 2016-08-09 Texas Instruments Incorporated Adaptive environmental context sample and update for comparing speech recognition
CN103531206B (en) * 2013-09-30 2017-09-29 华南理工大学 A kind of local speech emotional characteristic extraction method with global information of combination
CN103594084B (en) * 2013-10-23 2016-05-25 江苏大学 Combine speech-emotion recognition method and the system of punishment rarefaction representation dictionary learning
CN103531208B (en) * 2013-11-01 2016-08-03 东南大学 A kind of space flight stress emotion identification method based on short term memory weight fusion
CN103886869B (en) * 2014-04-09 2016-09-21 北京京东尚科信息技术有限公司 A kind of information feedback method based on speech emotion recognition and system
CN104464756A (en) * 2014-12-10 2015-03-25 黑龙江真美广播通讯器材有限公司 Small speaker emotion recognition system
CN105304078B (en) * 2015-10-28 2019-04-30 中国电子科技集团公司第三研究所 Target sound data training device and target sound data training method
CN106073706B (en) * 2016-06-01 2019-08-20 中国科学院软件研究所 A kind of customized information and audio data analysis method and system towards Mini-mental Status Examination
CN106356058B (en) * 2016-09-08 2019-08-20 河海大学 A kind of robust speech recognition methods based on multiband feature compensation
CN106653000A (en) * 2016-11-16 2017-05-10 太原理工大学 Emotion intensity test method based on voice information
CN106782615B (en) * 2016-12-20 2020-06-12 科大讯飞股份有限公司 Voice data emotion detection method, device and system
CN108335704A (en) * 2017-01-19 2018-07-27 晨星半导体股份有限公司 Vagitus detection circuit and relevant detection method
CN107831549A (en) * 2017-11-20 2018-03-23 中国地质大学(武汉) A kind of NMP cepstrum SST Time-frequency methods of ENPEMF signals
CN108091323B (en) * 2017-12-19 2020-10-13 想象科技(北京)有限公司 Method and apparatus for emotion recognition from speech
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
CN108766462B (en) * 2018-06-21 2021-06-08 浙江中点人工智能科技有限公司 Voice signal feature learning method based on Mel frequency spectrum first-order derivative
CN108986843B (en) * 2018-08-10 2020-12-11 杭州网易云音乐科技有限公司 Audio data processing method and device, medium and computing equipment
CN109394209B (en) * 2018-10-15 2021-07-06 汕头大学 Personalized emotion adjusting system and method for pregnant woman music treatment
CN110008987B (en) * 2019-02-20 2022-02-22 深圳大学 Method and device for testing robustness of classifier, terminal and storage medium
CN110211566A (en) * 2019-06-08 2019-09-06 安徽中医药大学 A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency
CN111179975B (en) * 2020-04-14 2020-08-04 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium
CN113255800B (en) * 2021-06-02 2021-10-15 中国科学院自动化研究所 Robust emotion modeling system based on audio and video

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011069055A2 (en) * 2009-12-04 2011-06-09 Stc.Unm System and methods of compressed sensing as applied to computer graphics and computer imaging
CN102419974A (en) * 2010-09-24 2012-04-18 国际商业机器公司 Sparse representation features for speech recognition
CN102768732A (en) * 2012-06-13 2012-11-07 北京工业大学 Face recognition method integrating sparse preserving mapping and multi-class property Bagging

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011069055A2 (en) * 2009-12-04 2011-06-09 Stc.Unm System and methods of compressed sensing as applied to computer graphics and computer imaging
CN102419974A (en) * 2010-09-24 2012-04-18 国际商业机器公司 Sparse representation features for speech recognition
CN102768732A (en) * 2012-06-13 2012-11-07 北京工业大学 Face recognition method integrating sparse preserving mapping and multi-class property Bagging

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Robust Facial Expression Recognition via Compressive Sensing;Shiqing Zhang et al;《Sensors》;20120321;第12卷(第3期);3747-3761 *
Shiqing Zhang et al.Robust Facial Expression Recognition via Compressive Sensing.《Sensors》.2012,第12卷(第3期),3747-3761.
噪声背景下的语音情感识别;张石清 等;《西南交通大学学报》;20090630;第44卷(第3期);442-447 *
张石清 等.噪声背景下的语音情感识别.《西南交通大学学报》.2009,第44卷(第3期),442-447.

Also Published As

Publication number Publication date
CN103021406A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103021406B (en) Robust speech emotion recognition method based on compressive sensing
Venkataramanan et al. Emotion recognition from speech
CN103117059B (en) Voice signal characteristics extracting method based on tensor decomposition
Shaw et al. Emotion recognition and classification in speech using artificial neural networks
Li et al. Robust speaker identification using an auditory-based feature
CN102655003B (en) Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
Paliwal et al. Usefulness of phase in speech processing
Waghmare et al. Emotion recognition system from artificial marathi speech using MFCC and LDA techniques
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
Nidhyananthan et al. Language and text-independent speaker identification system using GMM
CN103258537A (en) Method utilizing characteristic combination to identify speech emotions and device thereof
Palo et al. Comparison of neural network models for speech emotion recognition
Nirjon et al. sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices--a feasibility study
CN116682463A (en) Multi-mode emotion recognition method and system
Hao et al. A new feature in speech recognition based on wavelet transform
Ishac et al. A text-dependent speaker-recognition system
Aggarwal et al. Characterization between child and adult voice using machine learning algorithm
Yue et al. Speaker age recognition based on isolated words by using SVM
Meyer et al. Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities
Mengistu et al. Text independent Amharic language dialect recognition: A hybrid approach of VQ and GMM
Jagtap et al. A survey on speech emotion recognition using MFCC and different classifier
Sukhwal et al. Comparative study between different classifiers based speaker recognition system using MFCC for noisy environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141022

Termination date: 20161218

CF01 Termination of patent right due to non-payment of annual fee