CN103021406B - Robust speech emotion recognition method based on compressive sensing - Google Patents
Robust speech emotion recognition method based on compressive sensing Download PDFInfo
- Publication number
- CN103021406B CN103021406B CN201210551585.7A CN201210551585A CN103021406B CN 103021406 B CN103021406 B CN 103021406B CN 201210551585 A CN201210551585 A CN 201210551585A CN 103021406 B CN103021406 B CN 103021406B
- Authority
- CN
- China
- Prior art keywords
- formula
- test
- speech
- sample
- rarefaction representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
The invention discloses a robust speech emotion recognition method based on compressive sensing. The recognition method includes generating a noisy emotion speech sample, establishing an acoustic feature extraction module, constructing a sparse representation classifier model, and outputting a speech emotion recognition result. The robust speech emotion recognition method has the advantages that effects of noise on emotion speech in the natural environment are fully considered, and the robust speech emotion recognition method under the noise background is provided; validity of feature parameters in different types is fully considered, extraction of feature parameters is extended to the Mel frequency cepstrum coefficient (MFCC) from prosodic and tone features, and anti-noise effects of the feature parameters are further improved; and the high-performance robust speech emotion recognition method based on the compressive sensing theory is provided through the sparse representation distinguishing in the compressive sensing theory.
Description
Technical field
The present invention relates to speech processes, area of pattern recognition, particularly relate to a kind of robustness speech emotion identification method based on compressed sensing.
Background technology
The mankind's language has not only comprised text symbol information, is also carrying the information such as people's emotion and mood simultaneously.How to allow computing machine pass through voice signal automatic analysis and judgement speaker's affective state, i.e. so-called " speech emotional identification " research of aspect has become the focus in the fields such as speech processes, pattern-recognition.The final purpose of this research will be given computing machine emotion intelligence exactly, makes computing machine as people, can carry out nature, cordiality and vivo mutual.This research has important using value in fields such as artificial intelligence, Robotics, natural human-computer interaction technologies.
At present, for the research of speech emotional identification, be to using the emotion language material recorded in quiet environment as sentiment analysis and research object substantially.Yet the emotional speech in physical environment all can be subject to the interference of noise conventionally, comprised noise in various degree.Therefore, the more approaching reality of research for the robustness speech emotion recognition aspect under noise background, has more using value.But for the robustness speech emotion recognition research under noise background, the Research Literature of this respect is very few at present.
Speech emotional automatic identification technology mainly comprises two problems: the one, and which kind of effective speech characteristic parameter is affective feature extraction problem, extract for emotion recognition; The 2nd, emotion identification method problem, adopt which kind of effective mode identification method to classify and (see patent: Zou Cairong, a kind of speech-emotion recognition method-application number/patent No. based on support vector machine: 2006100973016) the emotion classification under the statement that comprises certain emotion.
At present, aspect affective feature extraction, in speech emotional identification, conventional affective characteristics parameter is prosodic features and tonequality feature, and the former comprises fundamental frequency, amplitude and pronunciation duration, and the latter comprises resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter etc.But the antinoise effect that these characteristic parameters itself show is very limited.Therefore, only use prosodic features and tonequality feature, be difficult to obtain good speech emotional recognition performance under noise background.In order to improve the antinoise effect of characteristic parameter, be necessary that the characteristic parameter that extracts other type is as spectrum signature, itself and prosodic features and tonequality feature are merged mutually.A kind of representational spectrum signature is exactly the Mel frequency cepstral coefficient (MFCC) that can reflect human hearing characteristic.
Aspect emotion identification method, the existing speech emotional knowledge method for distinguishing that has been successfully applied to mainly comprises: linear discrimination classification device (LDC), k nearest neighbor method (KNN), artificial neural network (ANN) and support vector machine (SVM).But these recognition methodss are more responsive to noise ratio, be difficult to obtain good robustness speech emotion recognition performance.Therefore, be necessary to develop new high performance robustness speech emotion identification method.
Introduce again compressed sensing (CS) technology.
Compressed sensing (CS) (is shown in document: E. J. Candes, M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 2008, 25 (2): 21-30) as a kind of brand-new signal, process and sampling theory, its core concept is, as long as signal can compress, or be sparse at certain transform domain, just can adopt one with the incoherent observing matrix of transform-based, the resulting high dimensional signal of conversion to be projected on a lower dimensional space, then by solving an optimization problem, just can in the middle of these a small amount of projections, with high probability, reconstruct original signal.Under this theoretical frame, sampling rate is no longer decided by the bandwidth of signal, and is decided by structure and the content of information in signal.
The original intention of compressed sensing (CS) research is compression and the expression for signal, but its most sparse expression has good identification, can be used for building sorter and (see document: Guha T, Ward RK. Learning Sparse Representations for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012,34 (8): 1576-1588.).At present, in existing speech emotional Study of recognition document, yet there are no and adopt the identification of the rarefaction representation in compressive sensing theory as the robustness recognition methods of speech emotional identification.The present invention utilizes the identification of the rarefaction representation in compressive sensing theory to realize the robustness speech emotion recognition under noise background.
Summary of the invention
Object of the present invention is exactly in order to overcome the deficiency of above-mentioned existing emotion recognition technology, and a kind of robustness speech emotion identification method based on compressed sensing is provided, for realizing the robustness speech emotion recognition under noise background.
The technical solution adopted in the present invention is:
A robustness speech emotion identification method based on compressed sensing, the method comprises following steps:
Produce Noise emotional speech sample, set up acoustic feature extraction module, build rarefaction representation sorter model, output speech emotional recognition result;
(1) produce the emotional speech sample of Noise, comprising:
By all speech samples of emotional speech Sample Storehouse, be divided into training sample and test sample book two parts, then each training sample and test sample book are added to white Gaussian noise, thereby produce the emotional speech sample of Noise;
(2) set up acoustic feature extraction module, comprising:
The emotional speech sample of Noise is carried out to acoustic feature extraction, and this acoustic feature extraction module comprises three parts: prosodic features parameter extraction, tonequality characteristic parameter extraction, Mel frequency cepstral coefficient MFCC extract;
(2-1) prosodic features parameter extraction, comprising: fundamental frequency, amplitude and pronunciation duration;
(2-2) tonequality characteristic parameter extraction, comprising: resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter;
(2-3) Mel frequency cepstral coefficient MFCC extracts, and comprising: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation;
(3) build rarefaction representation sorter model, comprising:
By acoustic feature extraction module, each emotional speech sample corresponding an eigenvector being formed by the acoustical characteristic parameters of extracting; The corresponding eigenvector of all emotional speech samples is all input in rarefaction representation sorter, for building rarefaction representation sorter model;
The method that builds rarefaction representation sorter is, first adopt the method for Its Sparse Decomposition, with training sample, test sample book is carried out to rarefaction representation, training sample is seen as to one group of base, by solving the method for 1-Norm minimum, obtain the rarefaction representation coefficient of test sample book, finally by the residual error after test sample book and rarefaction representation, classify;
(4) output speech emotional recognition result, comprising:
By the training and testing of rarefaction representation sorter, output speech emotional recognition result, in emotion recognition test, adopt crosscheck technology 10 times, to be divided equally be 10 parts to all statements, each 9 piece of data wherein of using are used for training, 1 remaining piece of data is for test, and the corresponding repetition of such identification experimentation 10 times, finally gets the mean value of 10 times as recognition result.
Described fundamental frequency adopts correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient;
Described amplitude adopts a square summation approach to ask for, and extracts 9 statistics parameters that amplitude is relevant, comprises mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values;
The described pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
Described resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters;
Described frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz
500, 500-1000Hz frequency band energy mean value SED
1000, 1000-2500Hz frequency band energy mean value SED
2500, 2500-4000Hz frequency band energy mean value SED
4000, 4000-5000Hz frequency band energy mean value SED
5000;
Described harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
Described short-term jitter parameter: comprise jitter Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude, can obtain by calculating the slope variation of fundamental curve and amplitude curve;
The computing formula of jitter Jitter is defined as:
In formula, T
irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase;
The computing formula of Shimmer Shimmer is defined as:
In formula, E
irepresent i peak-to-peak energy.
The method of described structure rarefaction representation sorter, concrete steps are as follows:
The training sample of given a certain class, test sample book is seen the linear combination of similar training sample as,
In formula, y
k, testrepresent k
ththe test sample book of class, y
k,irepresent k
ththe i of class
thindividual training sample, α
k,ithe weight vector that represents corresponding training sample, ε
krepresent error;
For other training sample of all target class, (formula 1) can be expressed as:
In formula, c represents total classification number of all training samples;
Adopt the form of matrix to represent (formula 2),
Y
k, test=A α+ε (formula 3)
Wherein
In rarefaction representation sorter, in claim vector α except with k
thoutside the relevant element of class, remaining element should be all zero, in order to obtain weight vector α, need to solve the optimization problem under L-0 norm meaning below:
For solving (formula 5), L-0 norm optimization problem is converted into L-1 norm duty Optimization:
This is a protruding optimization problem, can be converted into linear programming problem and solve;
In order further to improve the noise robustness of rarefaction representation, the L-1 norm optimization problem of a weighting of design, (formula 6) can be expressed as
Wherein, weight factor variable W can be expressed as
In formula, σ is a constant, y
recons(i)=A α
irepresent that one based on weight vector α
ireconstructed sample, wherein, constant σ is made as 1, for the larger data of noise ratio, residual values || y
test-y
test(i) ||
2will be larger, its corresponding weight factor variable can be smaller, thus impact that can attenuating noise;
A given new test sample y
test, first by solving (formula 7), obtain weight vector α, if the corresponding k of coefficient value of maximum in the middle of the nonzero coefficient of weight vector α
thclass, and by y
testbe included in the middle of this classification, or by y
testbe included in the middle of the corresponding classification of coefficient value maximum in weight vector α.
5, the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
The training and testing of described rarefaction representation sorter, comprises the following steps:
(4-1) with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of a given class emotion, obtains its weight vector α by solving the L-1 norm optimization problem of (formula 7);
(4-2) to each class emotion (i=1,2 ..., 7) test sample y
test, be first similar to and reconstruct a new samples, be designated as:
, then calculate new samples and the y of this reconstruct
testresidual error, i.e. r (y
test, i)=|| y
test-y
test(i) ||
2;
(4-3) getting residual error is that the classification number i of minimum value is as test sample y
testemotion classification, i.e. identify (y
test)=arg min
ir(y
test, i), thus the recognition result of output different emotions classification.
In described emotional speech Sample Storehouse, choose anger, happiness, sadness, fear, dislike, be sick of and seven kinds of emotional speech samples of ameleia.
Beneficial effect effect of the present invention is:
1. the emotional speech fully taking into account in physical environment can be subject to the impact of noise conventionally, and the emotion identification method of the robustness speech under a kind of noise background is provided.
2. fully take into account the validity of dissimilar characteristic parameter, the extraction of characteristic parameter, from prosodic features and tonequality feature two aspects, is extended to Mel frequency cepstral coefficient MFCC, further improve the antinoise effect of characteristic parameter.
3. utilize the identification of the rarefaction representation in compressive sensing theory, a kind of high performance robustness speech emotion identification method based on compressive sensing theory is provided.
Accompanying drawing explanation
Fig. 1---speech emotional recognition system block diagram.
Fig. 2---the statistics of emotion acoustic feature parameter.
Fig. 3---under different signal to noise ratio snr, the obtained speech emotional recognition performance (%) of distinct methods compares.
Fig. 4---the correct recognition rata (%) of obtained different emotions type when the inventive method behaves oneself best.
Embodiment
Fig. 1 is native system block diagram, mainly comprises two major parts: acoustic feature extracts, the training and testing of rarefaction representation sorter.
One, acoustic feature extracts
From German emotional speech Sample Storehouse, Berlin(is shown in document: Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B. A database of German emotional speech. In:Proceedings of. Interspeech-2005, Lisbon, Portugal, 2005, pp. 1-4.), choose anger, happiness, sadness, fear, dislike, be sick of and neutral (ameleia) seven kinds of emotional speech samples, totally 535.Each emotional speech sample of choosing is added to white Gaussian noise, and through pre-emphasis, minute frame and windowing pre-service, wherein frame length is 10ms.Then extract the acoustical characteristic parameters of three aspects: prosodic features, tonequality feature and Mel frequency cepstral coefficient MFCC.Fig. 2 has provided the statistical conditions of the emotion acoustic feature parameter of these three aspects of extracting, 204 altogether.The concrete condition of these characteristic parameter extraction, is expressed as follows:
1. prosodic features parameter extraction: comprise fundamental frequency, amplitude and pronunciation duration.
(1-1) fundamental frequency: adopt correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient.
(1-2) amplitude: adopt a square summation approach to ask for, extract 9 statistics parameters that amplitude is relevant, comprise mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values.
(1-3) the pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
2. tonequality characteristic parameter extraction: comprise that resonance peak, frequency band energy distribute, harmonic noise ratio, and short-term jitter parameter.
(2-1) resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters.The criterion of approaching of Burger Burg method is exactly to make forward and reverse prediction square error sum of lattice filter minimum.(see document: Erkelens JS, Broersen PMT. Bias propagation in the autocorrelation method of linear prediction[J]. IEEE Transactions on Speech and Audio Processing, 1997,5 (2): 116-119.)
(2-2) frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz
500, 500-1000Hz frequency band energy mean value SED
1000, 1000-2500Hz frequency band energy mean value SED
2500, 2500-4000Hz frequency band energy mean value SED
4000, 4000-5000Hz frequency band energy mean value SED
5000.
(2-3) harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
(2-4) short-term jitter parameter: comprise that (Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude to jitter, can obtain by calculating the slope variation of fundamental curve and amplitude curve.
The computing formula of jitter (Jitter) is defined as:
In formula, T
irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase.
The computing formula of Shimmer (Shimmer) is defined as:
In formula, E
irepresent i peak-to-peak energy.
3. Mel frequency cepstral coefficient MFCC: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation.
Two, the training and testing of rarefaction representation sorter
The training and testing step of rarefaction representation sorter comprises:
1. with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of given a certain class emotion, obtains its weight vector by solving the L-1 norm optimization problem of (formula 7).
To each class emotion (i=1,2 ..., 7) test sample y
test, be first similar to and reconstruct a new samples, be designated as:
, then calculate new samples and the y of this reconstruct
testresidual error, i.e. r (y
test, i)=|| y
test-y
recons(i) ||
2.
3. get residual error and be the classification number i of minimum value as test sample y
testemotion classification, i.e. identify (y
test)=arg min
ir(y
test, i), thus the recognition result of output different emotions classification.
Three, the evaluation of recognition system
In order to improve the confidence level of test result, in emotion recognition test, adopt crosscheck technology 10 times.
Fig. 3 has provided the inventive method and other four kinds of recognition methodss in the situation that of different signal to noise ratio (snr), as linear discrimination classification device (LDC), k nearest neighbor method (KNN), artificial neural network (ANN) and support vector machine (SVM), obtained speech emotional recognition performance (%) relatively.The value of signal to noise ratio (snr), before this in noiseless (the acoustic feature data that directly the emotion statement from Berlin database extracts) situation, and then from 30dB, reduces 5dB successively, until-10dB cut-off.This result shows, uses the inventive method obtained speech emotional recognition performance under various signal to noise ratio (S/N ratio) conditions all will obviously be better than other four kinds of recognition methodss.Visible, use the inventive method can obtain excellent robustness speech emotion recognition performance.In addition,, under noise-free case, the inventive method has also obtained best recognition performance.When Fig. 4 has provided use the inventive method and has behaved oneself best, i.e. the correct recognition rata (%) of obtained different emotions type under noise-free case.Wherein, the correct recognition rata that in Fig. 4, each concrete affective style of diagonal line runic data representation obtains.
Claims (5)
1. the robustness speech emotion identification method based on compressed sensing, is characterized in that, the method comprises following steps:
Produce Noise emotional speech sample, set up acoustic feature extraction module, build rarefaction representation sorter model, output speech emotional recognition result;
(1) produce the emotional speech sample of Noise, comprising:
By all speech samples of emotional speech Sample Storehouse, be divided into training sample and test sample book two parts, then each training sample and test sample book are added to white Gaussian noise, thereby produce the emotional speech sample of Noise;
(2) set up acoustic feature extraction module, comprising:
The emotional speech sample of Noise is carried out to acoustic feature extraction, and this acoustic feature extraction module comprises three parts: prosodic features parameter extraction, tonequality characteristic parameter extraction, Mel frequency cepstral coefficient MFCC extract;
(2-1) prosodic features parameter extraction, comprising: fundamental frequency, amplitude and pronunciation duration;
(2-2) tonequality characteristic parameter extraction, comprising: resonance peak, frequency band energy distribution, harmonic noise ratio and short-term jitter parameter;
(2-3) Mel frequency cepstral coefficient MFCC extracts, and comprising: extract 13 dimension MFCC feature and single order and second derivative parameters, then calculate their mean value and standard deviation;
(3) build rarefaction representation sorter model, comprising:
By acoustic feature extraction module, each emotional speech sample corresponding an eigenvector being formed by the acoustical characteristic parameters of extracting; The corresponding eigenvector of all emotional speech samples is all input in rarefaction representation sorter, for building rarefaction representation sorter model;
The method that builds rarefaction representation sorter is, first adopt the method for Its Sparse Decomposition, with training sample, test sample book is carried out to rarefaction representation, training sample is seen as to one group of base, by solving the method for 1-Norm minimum, obtain the rarefaction representation coefficient of test sample book, finally by the residual error after test sample book and rarefaction representation, classify;
The method of described structure rarefaction representation sorter, concrete steps are as follows:
The training sample of given a certain class, test sample book is seen the linear combination of similar training sample as,
(formula 1)
In formula, y
k, testrepresent k
ththe test sample book of class, y
k,irepresent k
ththe i of class
thindividual training sample, α
k,ithe weight vector that represents corresponding training sample, ε
krepresent error;
For other training sample of all target class, (formula 1) can be expressed as:
(formula 2)
In formula, c represents total classification number of all training samples;
Adopt the form of matrix to represent (formula 2),
Y
k, test=A α+ε (formula 3)
Wherein
(formula 4)
In theory, in rarefaction representation sorter, in claim vector α except with k
thoutside the relevant element of class, remaining element should be all zero; In order to obtain weight vector α, need to solve the optimization problem under L-0 norm meaning below:
(formula 5)
For solving (formula 5), L-0 norm optimization problem is converted into L-1 norm duty Optimization:
(formula 6)
This is a protruding optimization problem, can be converted into linear programming problem and solve;
In order further to improve the noise robustness of rarefaction representation, the L-1 norm optimization problem of a weighting of design, (formula 6) can be expressed as:
(formula 7)
Wherein, weight factor variable W can be expressed as:
(formula 8)
In formula, σ is a constant, y
recons(i)=A α
irepresent that one based on weight vector α
ireconstructed sample, wherein, constant σ is made as 1, for the larger data of noise ratio, residual values || y-y
recons(i) ||
2will be larger, its corresponding weight factor variable can be smaller;
A given new test sample y
test, first by solving (formula 7), obtain weight vector α, if the corresponding k of coefficient value of maximum in the middle of the nonzero coefficient of weight vector α
thclass, and by y
testbe included in the middle of this classification, or by y
testbe included in the middle of the corresponding classification of coefficient value maximum in weight vector α;
(4) output speech emotional recognition result, comprising:
By the training and testing of rarefaction representation sorter, output speech emotional recognition result, in emotion recognition test, adopt crosscheck technology 10 times, to be divided equally be 10 parts to all statements, each 9 piece of data wherein of using are used for training, 1 remaining piece of data is for test, and the corresponding repetition of such identification experimentation 10 times, finally gets the mean value of 10 times as recognition result.
2. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
Described fundamental frequency adopts correlation method to extract the pitch contour curve of emotional speech, then calculate 10 statistics parameters of this fundamental curve, comprise maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values, mean value, standard deviation, average absolute gradient;
Described amplitude adopts a square summation approach to ask for, and extracts 9 statistics parameters that amplitude is relevant, comprises mean value, standard deviation, maximal value, minimum value, variation range, upper quartile, median, lower quartile, interior four minutes extreme values;
The described pronunciation duration: the pronunciation duration characterizes the otherness of speaking on time construction of different emotions voice, extract 6 of relevant parameters of pronunciation duration, comprise that pronunciation continues T.T., the ratio of sound pronunciation duration, noiseless pronunciation duration, sound and noiseless time, the ratio of sound and the T.T. of pronouncing, the ratio of noiseless and T.T. of pronouncing.
3. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that,
Described resonance peak: adopt Burger Burg method to calculate 14 rank linear predictor coefficient LPC of emotional speech, with peak value, detect again the shared bandwidth of median of mean value, standard deviation, median and these three resonance peaks that method calculates first, second, third resonance peak F1, F2, F3, extract altogether 12 resonance peak relevant feature parameters;
Described frequency band energy distributes: extract the energy distribution parameter S ED of 5 different frequency bands, i.e. the frequency band energy mean value SED of 0-500Hz
500, 500-1000Hz frequency band energy mean value SED
1000, 1000-2500Hz frequency band energy mean value SED
2500, 2500-4000Hz frequency band energy mean value SED
4000, 4000-5000Hz frequency band energy mean value SED
5000;
Described harmonic noise ratio: extract harmonic noise than the mean value of HNR, standard deviation, minimum value, maximal value, variation range, its computing formula is:
(formula 1)
Described short-term jitter parameter: comprise jitter Jitter and Shimmer Shimmer, they represent respectively the subtle change of fundamental frequency and amplitude, can obtain by calculating the slope variation of fundamental curve and amplitude curve;
The computing formula of jitter Jitter is defined as:
(formula 2)
In formula, T
irepresent i peak-to-peak phase, the number that N is the peak-to-peak phase;
The computing formula of Shimmer Shimmer is defined as:
(formula 3)
In formula, E
irepresent i peak-to-peak energy.
4. the robustness speech emotion identification method based on compressed sensing as claimed in claim 1, is characterized in that:
The training and testing of described rarefaction representation sorter, comprises the following steps:
(4-1) with the eigenvector of training sample, each class emotion test sample is carried out to rarefaction representation, the test sample book of a given class emotion, obtains its weight vector α by solving the L-1 norm optimization problem of (formula 7);
(4-2) test sample y to each class emotion (i=1,2, L, 7)
test, be first similar to and reconstruct a new samples, be designated as:
then calculate new samples and the y of this reconstruct
testresidual error, i.e. r (y
test, i)=|| y
test-y
recons(i) ||
2;
(4-3) getting residual error is that the classification number i of minimum value is as test sample y
testemotion classification, i.e. identify (y
test)=arg min
ir(y
test, i), thus the recognition result of output different emotions classification.
5. the robustness speech emotion identification method based on compressed sensing as described in claim 1-4 any one, it is characterized in that, in described emotional speech Sample Storehouse, choose anger, happiness, sadness, fear, dislike, be sick of and seven kinds of emotional speech samples of ameleia.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210551585.7A CN103021406B (en) | 2012-12-18 | 2012-12-18 | Robust speech emotion recognition method based on compressive sensing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210551585.7A CN103021406B (en) | 2012-12-18 | 2012-12-18 | Robust speech emotion recognition method based on compressive sensing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103021406A CN103021406A (en) | 2013-04-03 |
CN103021406B true CN103021406B (en) | 2014-10-22 |
Family
ID=47969938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210551585.7A Expired - Fee Related CN103021406B (en) | 2012-12-18 | 2012-12-18 | Robust speech emotion recognition method based on compressive sensing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103021406B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345923B (en) * | 2013-07-26 | 2016-05-11 | 电子科技大学 | A kind of phrase sound method for distinguishing speek person based on rarefaction representation |
US9412373B2 (en) * | 2013-08-28 | 2016-08-09 | Texas Instruments Incorporated | Adaptive environmental context sample and update for comparing speech recognition |
CN103531206B (en) * | 2013-09-30 | 2017-09-29 | 华南理工大学 | A kind of local speech emotional characteristic extraction method with global information of combination |
CN103594084B (en) * | 2013-10-23 | 2016-05-25 | 江苏大学 | Combine speech-emotion recognition method and the system of punishment rarefaction representation dictionary learning |
CN103531208B (en) * | 2013-11-01 | 2016-08-03 | 东南大学 | A kind of space flight stress emotion identification method based on short term memory weight fusion |
CN103886869B (en) * | 2014-04-09 | 2016-09-21 | 北京京东尚科信息技术有限公司 | A kind of information feedback method based on speech emotion recognition and system |
CN104464756A (en) * | 2014-12-10 | 2015-03-25 | 黑龙江真美广播通讯器材有限公司 | Small speaker emotion recognition system |
CN105304078B (en) * | 2015-10-28 | 2019-04-30 | 中国电子科技集团公司第三研究所 | Target sound data training device and target sound data training method |
CN106073706B (en) * | 2016-06-01 | 2019-08-20 | 中国科学院软件研究所 | A kind of customized information and audio data analysis method and system towards Mini-mental Status Examination |
CN106356058B (en) * | 2016-09-08 | 2019-08-20 | 河海大学 | A kind of robust speech recognition methods based on multiband feature compensation |
CN106653000A (en) * | 2016-11-16 | 2017-05-10 | 太原理工大学 | Emotion intensity test method based on voice information |
CN106782615B (en) * | 2016-12-20 | 2020-06-12 | 科大讯飞股份有限公司 | Voice data emotion detection method, device and system |
CN108335704A (en) * | 2017-01-19 | 2018-07-27 | 晨星半导体股份有限公司 | Vagitus detection circuit and relevant detection method |
CN107831549A (en) * | 2017-11-20 | 2018-03-23 | 中国地质大学(武汉) | A kind of NMP cepstrum SST Time-frequency methods of ENPEMF signals |
CN108091323B (en) * | 2017-12-19 | 2020-10-13 | 想象科技(北京)有限公司 | Method and apparatus for emotion recognition from speech |
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
CN108766462B (en) * | 2018-06-21 | 2021-06-08 | 浙江中点人工智能科技有限公司 | Voice signal feature learning method based on Mel frequency spectrum first-order derivative |
CN108986843B (en) * | 2018-08-10 | 2020-12-11 | 杭州网易云音乐科技有限公司 | Audio data processing method and device, medium and computing equipment |
CN109394209B (en) * | 2018-10-15 | 2021-07-06 | 汕头大学 | Personalized emotion adjusting system and method for pregnant woman music treatment |
CN110008987B (en) * | 2019-02-20 | 2022-02-22 | 深圳大学 | Method and device for testing robustness of classifier, terminal and storage medium |
CN110211566A (en) * | 2019-06-08 | 2019-09-06 | 安徽中医药大学 | A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency |
CN111179975B (en) * | 2020-04-14 | 2020-08-04 | 深圳壹账通智能科技有限公司 | Voice endpoint detection method for emotion recognition, electronic device and storage medium |
CN113255800B (en) * | 2021-06-02 | 2021-10-15 | 中国科学院自动化研究所 | Robust emotion modeling system based on audio and video |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011069055A2 (en) * | 2009-12-04 | 2011-06-09 | Stc.Unm | System and methods of compressed sensing as applied to computer graphics and computer imaging |
CN102419974A (en) * | 2010-09-24 | 2012-04-18 | 国际商业机器公司 | Sparse representation features for speech recognition |
CN102768732A (en) * | 2012-06-13 | 2012-11-07 | 北京工业大学 | Face recognition method integrating sparse preserving mapping and multi-class property Bagging |
-
2012
- 2012-12-18 CN CN201210551585.7A patent/CN103021406B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011069055A2 (en) * | 2009-12-04 | 2011-06-09 | Stc.Unm | System and methods of compressed sensing as applied to computer graphics and computer imaging |
CN102419974A (en) * | 2010-09-24 | 2012-04-18 | 国际商业机器公司 | Sparse representation features for speech recognition |
CN102768732A (en) * | 2012-06-13 | 2012-11-07 | 北京工业大学 | Face recognition method integrating sparse preserving mapping and multi-class property Bagging |
Non-Patent Citations (4)
Title |
---|
Robust Facial Expression Recognition via Compressive Sensing;Shiqing Zhang et al;《Sensors》;20120321;第12卷(第3期);3747-3761 * |
Shiqing Zhang et al.Robust Facial Expression Recognition via Compressive Sensing.《Sensors》.2012,第12卷(第3期),3747-3761. |
噪声背景下的语音情感识别;张石清 等;《西南交通大学学报》;20090630;第44卷(第3期);442-447 * |
张石清 等.噪声背景下的语音情感识别.《西南交通大学学报》.2009,第44卷(第3期),442-447. |
Also Published As
Publication number | Publication date |
---|---|
CN103021406A (en) | 2013-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103021406B (en) | Robust speech emotion recognition method based on compressive sensing | |
Venkataramanan et al. | Emotion recognition from speech | |
CN103117059B (en) | Voice signal characteristics extracting method based on tensor decomposition | |
Shaw et al. | Emotion recognition and classification in speech using artificial neural networks | |
Li et al. | Robust speaker identification using an auditory-based feature | |
CN102655003B (en) | Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient) | |
Murugappan et al. | DWT and MFCC based human emotional speech classification using LDA | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
CN103456302A (en) | Emotion speaker recognition method based on emotion GMM model weight synthesis | |
Paliwal et al. | Usefulness of phase in speech processing | |
Waghmare et al. | Emotion recognition system from artificial marathi speech using MFCC and LDA techniques | |
Chauhan et al. | Speech to text converter using Gaussian Mixture Model (GMM) | |
Nidhyananthan et al. | Language and text-independent speaker identification system using GMM | |
CN103258537A (en) | Method utilizing characteristic combination to identify speech emotions and device thereof | |
Palo et al. | Comparison of neural network models for speech emotion recognition | |
Nirjon et al. | sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices--a feasibility study | |
CN116682463A (en) | Multi-mode emotion recognition method and system | |
Hao et al. | A new feature in speech recognition based on wavelet transform | |
Ishac et al. | A text-dependent speaker-recognition system | |
Aggarwal et al. | Characterization between child and adult voice using machine learning algorithm | |
Yue et al. | Speaker age recognition based on isolated words by using SVM | |
Meyer et al. | Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities | |
Mengistu et al. | Text independent Amharic language dialect recognition: A hybrid approach of VQ and GMM | |
Jagtap et al. | A survey on speech emotion recognition using MFCC and different classifier | |
Sukhwal et al. | Comparative study between different classifiers based speaker recognition system using MFCC for noisy environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20141022 Termination date: 20161218 |
|
CF01 | Termination of patent right due to non-payment of annual fee |