CN108154879B - Non-specific human voice emotion recognition method based on cepstrum separation signal - Google Patents

Non-specific human voice emotion recognition method based on cepstrum separation signal Download PDF

Info

Publication number
CN108154879B
CN108154879B CN201711434048.3A CN201711434048A CN108154879B CN 108154879 B CN108154879 B CN 108154879B CN 201711434048 A CN201711434048 A CN 201711434048A CN 108154879 B CN108154879 B CN 108154879B
Authority
CN
China
Prior art keywords
signal
cepstrum
emotion
mel
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711434048.3A
Other languages
Chinese (zh)
Other versions
CN108154879A (en
Inventor
胡维平
郝梓岚
王艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201711434048.3A priority Critical patent/CN108154879B/en
Publication of CN108154879A publication Critical patent/CN108154879A/en
Application granted granted Critical
Publication of CN108154879B publication Critical patent/CN108154879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention discloses a non-specific human voice emotion recognition method based on cepstrum separation signals, which specifically comprises the following steps: preprocessing an emotion voice library; extracting traditional characteristics from the preprocessed emotion voice library; performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library; performing feature extraction on the reconstructed voice signal to obtain a reconstructed emotion voice library; dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a test set, inputting the test set into the trained training set after the training set is trained by an SVM classifier, and outputting a judgment result after voice recognition; the recognition method can effectively improve the speech emotion recognition rate of the unspecified person.

Description

Non-specific human voice emotion recognition method based on cepstrum separation signal
Technical Field
The invention relates to the technical field of nonspecific human voice recognition, in particular to a method for recognizing nonspecific human voice emotion based on cepstrum separation signals.
Background
The glottis and the vocal tract signals contain rich emotion information, and due to the difference of personal vocal tracts, the vocal tract information generally contains more personal characteristics, which generates much interference to the emotion recognition work of non-specific people. In the previous work of spectral feature extraction, we extract features of the whole speech signal, and such features carry a lot of personal information of speakers. Such features are often effective for emotion recognition of a particular person. But the emotion recognition effect for non-specific persons is not as good as that for specific persons.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a non-specific human voice emotion recognition method based on cepstrum separation signals.
The technical scheme for realizing the purpose of the invention is as follows:
a non-specific human voice emotion recognition method based on cepstrum separation signals specifically comprises the following steps:
s1, preprocessing an emotion voice library;
s2, extracting traditional characteristics from the preprocessed emotion voice library;
s3, performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library;
s4, extracting the characteristics of the reconstructed voice signal to obtain a reconstructed emotion voice library;
s5, dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a testing set, inputting the testing set into a trained classifier after the training set is trained by an SVM classifier, and outputting a judgment result after voice recognition;
and completing emotion recognition of unspecific human voice through the steps.
In step S1, the emotion speech library contains 7 emotions, and is subjected to frame windowing processing by using a 16Khz sampling rate and 8bit quantization.
The 7 emotions comprise neutrality, anger, fear, happiness, sadness, aversion and boredom.
And the framing is carried out within 10-30 ms.
And the windowing adopts a Hamming window.
In step S2, the extracting traditional features is extracting traditional acoustic features of the framed speech in the emotion speech library, and the extracted acoustic features include: extracting rhythm characteristic parameters, extracting sound quality characteristics, extracting nonlinear characteristics and extracting spectral characteristics;
prosodic feature parameter extraction, comprising: mean value of pitch frequency, mean value of short-time energy and rate of change of zero crossing rate;
sound quality feature extraction, comprising: frequency perturbation entropy and amplitude perturbation entropy;
nonlinear feature extraction, comprising: a Hurst index;
spectral feature extraction, comprising: mel frequency domain cepstral coefficients (MFCC), linear prediction coefficients LPC, and non-linear Mel frequency domain parameters (NFD _ Mel);
the Mel frequency domain cepstrum coefficient (MFCC) is obtained by extracting 12-dimensional MFCC bits and 24-dimensional first-order difference, and calculating average value.
The linear prediction coefficient LPC is obtained by extracting 12-dimensional LPC and calculating the average value of the 12-dimensional LPC;
the nonlinear Mel frequency domain parameter (NFD _ Mel) is calculated by the following steps:
s2-1, firstly, performing short-time Fourier transform on each frame of signal after S1 framing, then adding a Teager energy operator, and taking the frequency spectrum amplitude as the power of 2 to obtain an energy spectrum;
s2-2, inputting the energy spectrum obtained in the S2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s2-3, performing discrete cosine transform on the logarithmic energy obtained in the S2-2 to obtain a static 12-order NFD _ Mel parameter;
s2-4, carrying out first-order difference on the NFD _ Mel coefficient in S2-3 to obtain a dynamic 12-order NFD _ Mel parameter;
s2-5, the parameter results in S2-3 and S2-4 are combined together to finally form the NFD _ Mel parameter of 24 th order.
In step S3, the cepstrum domain separation and reconstruction are performed on the speech signal, where the framing adopts a 256-point frame length, and the frame shift is 128, specifically:
s3-1, calculating complex cepstrum by taking the signal x (n) of each frame after S1 framing, wherein the signal x (n) of each frame is excited by glottal pulse e (n) and filtered by sound channel response v (n)Is obtained by wave, i.e.
Figure BDA0001525454330000021
Performing Z transformation on x (n) to change the convolution signal into a product signal, then taking logarithm operation to change the product signal into an additive signal, and finally taking Z inverse transformation on the additive signal to obtain a complex cepstrum;
s3-2, taking each frame of signal x (n) after S1 framing to calculate a cepstrum signal, calculating to take a real part of the signal after performing Z transformation on the x (n) to perform logarithmic operation, and finally performing Z inverse transformation to obtain a cepstrum;
s3-3, the pitch period range of the human voice is 50 hz-700 hz, the maximum value of the excitation source impulse is searched in the cepstrum in the range, if the impulse amplitude of the maximum value exceeds 0.08, the position of the peak point A is recorded and judged as voiced, otherwise, the position is unvoiced and the frame is skipped;
s3-4, because the cepstrum loses phase information of the signal during calculation, when the cepstrum is judged to be voiced, the signal is separated on a complex cepstrum, the signal is divided into vocal tract response and glottal excitation on the complex cepstrum by taking a point A as a demarcation point, in order to retain all the glottal information, the vocal tract information is gradually contained, the point A is moved towards an original point, the moving distance is marked as L, L is b, A, the moved end point is marked as A1, wherein b is an adjustable parameter, and b is more than or equal to 0 and less than or equal to 1;
s3-4, according to the symmetry of the complex cepstrum, the origin signal is obtained from the symmetrical point of the point A1, and the two symmetrical signals are combined to be set as
Figure BDA0001525454330000031
To pair
Figure BDA0001525454330000032
Performing inverse transform of the complex cepstrum to reconstruct the time domain signal x1(n) reconstructed speech signal x1(n) only partial channel information and entire glottal information.
In step S4, feature extraction is performed on the reconstructed speech signal, and a frame length of 256 points and a frame shift of 128 are adopted, which specifically includes the following steps:
s4-1-1, take x1(n) carrying out short-time Fourier transform on the voice signal to obtain a frequency spectrum, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;
s4-1-2, inputting the energy spectrum obtained in the S4-1-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-1-3, obtaining static 12-order CSS-MFCC parameters by discrete cosine transform of the logarithmic energy obtained in S4-1-2;
s4-1-4, performing first order difference on the CSS-MFCC coefficients in the S4-1-3 to obtain dynamic 12-order CSS-MFCC parameters;
s4-1-5, combining parameter results in S4-1-3 and S4-1-4 together to finally form 24-order MFCC parameters, and taking 24-order CSS-MFCC mean values as global features;
s4-2-1, take x1(n) carrying out short-time Fourier transform, adding a teager energy operator to the signal through a formula, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum, wherein the teager energy operator is as follows:
ψ(x(n))=x2(n)-x(n-1)x(n+1);
s4-2-2, inputting the energy spectrum obtained in the S4-2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-2-3, obtaining static 12-order CSS-NFDMel parameters by discrete cosine transform of the logarithmic energy obtained in S4-2-2;
s4-2-4, performing first-order difference on the CSS-NFDMel coefficients in the S4-2-3 to obtain dynamic 12-order CSS-NFDMel parameters;
s4-2-5, combining the parameter results in S4-2-3 and S4-2-4 together to finally form 24-order NFD _ Mel parameter, and adopting 24-order CSS-NFDMel mean value as global feature.
In step S5, the reconstructed emotion speech library obtained in step S4 is divided into 65% of training sets and 35% of test sets, the training sets are trained by using an SVM classifier, the test sets are input into the trained training sets, and after speech recognition is performed, a decision result is output, specifically:
s5-1, extracting characteristics of the emotion voice library: the method comprises the following steps of combining features of a mean value of pitch frequency, a short-time energy mean value, a zero-crossing rate change rate, a frequency perturbation entropy, an amplitude perturbation entropy, a Hurst index, a Mel frequency domain cepstrum coefficient (MFCC), a linear prediction coefficient LPC and a non-linear Mel frequency domain parameter (NFD _ Mel);
s5-2, training 65% of the characteristics of S5-1 as a training set by using an SVM classifier, and the rest 35% as a test set for testing the performance of the classifier of the training set, inputting the test set into the training set after training, and outputting a judgment result after voice recognition.
Has the advantages that: the method for recognizing the speech emotion of the unspecific person based on the cepstrum separation signal combines the new features with the traditional features, retains vocal cord information, abandons part of vocal tract information, searches for an optimal separation point, extracts features from the processed signal, and can effectively improve the speech emotion recognition rate of the unspecific person.
Drawings
FIG. 1 is a cepstral separation signal flow diagram;
FIG. 2 is a diagram showing the recognition rate of CSS-MFCC and combined features.
Detailed Description
The invention is further illustrated but not limited by the following figures and examples.
Example (b):
a non-specific human voice emotion recognition method based on cepstrum separation signals specifically comprises the following steps:
s1, preprocessing an emotion voice library;
s2, extracting traditional characteristics from the preprocessed emotion voice library;
s3, performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library;
s4, extracting the characteristics of the reconstructed voice signal to obtain a reconstructed emotion voice library;
s5, dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a testing set, inputting the testing set into the training set after training by adopting an SVM classifier, and outputting a judgment result after voice recognition;
and completing emotion recognition of unspecific human voice through the steps.
In step S1, the emotion speech library contains 7 emotions, and is subjected to frame windowing processing by using a 16Khz sampling rate and 8bit quantization.
The 7 emotions comprise neutrality, anger, fear, happiness, sadness, aversion and boredom.
And the framing is carried out within 10-30 ms.
And the windowing adopts a Hamming window.
In step S2, the extracting traditional features is extracting traditional acoustic features of the framed speech in the emotion speech library, and the extracted acoustic features include: extracting rhythm characteristic parameters, extracting sound quality characteristics, extracting nonlinear characteristics and extracting spectral characteristics;
prosodic feature parameter extraction, comprising: mean value of pitch frequency, mean value of short-time energy and rate of change of zero crossing rate;
sound quality feature extraction, comprising: frequency perturbation entropy and amplitude perturbation entropy;
nonlinear feature extraction, comprising: a Hurst index;
spectral feature extraction, comprising: mel frequency domain cepstral coefficients (MFCC), linear prediction coefficients LPC, and non-linear Mel frequency domain parameters (NFD _ Mel);
the Mel frequency domain cepstrum coefficient (MFCC) is obtained by extracting 12-dimensional MFCC bits and 24-dimensional first-order difference, and calculating average value.
The linear prediction coefficient LPC is obtained by extracting 12-dimensional LPC and calculating the average value of the 12-dimensional LPC;
the nonlinear Mel frequency domain parameter (NFD _ Mel) is calculated by the following steps:
s2-1, firstly, performing short-time Fourier transform on each frame of signal after framing, then adding a Teager energy operator, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;
s2-2, inputting the energy spectrum obtained in the S2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s2-3, performing discrete cosine transform on the logarithmic energy obtained in the S2-2 to obtain a static 12-order NFD _ Mel parameter;
s2-4, carrying out first-order difference on the NFD _ Mel coefficient in S2-3 to obtain a dynamic 12-order NFD _ Mel parameter;
s2-5, the parameter results in S2-3 and S2-4 are combined together to finally form the NFD _ Mel parameter of 24 th order.
In step S3, performing cepstrum domain separation and reconstruction on the speech signal, and adopting a frame length of 256 points and a frame shift of 128, as shown in fig. 1, specifically:
s3-1, calculating complex cepstrum by taking each frame of signal x (n) after S1 framing, wherein each frame of speech signal x (n) is obtained by glottal pulse excitation e (n) through filtering by sound channel response v (n), that is, the method comprises the steps of
Figure BDA0001525454330000051
Performing Z transformation on x (n) to change the convolution signal into a product signal, then taking logarithm operation to change the product signal into an additive signal, and finally taking Z inverse transformation on the additive signal to obtain a complex cepstrum;
s3-2, taking each frame of signal x (n) after S1 framing to calculate a cepstrum signal, calculating to take a real part of the signal after performing Z transformation on the x (n) to perform logarithmic operation, and finally performing Z inverse transformation to obtain a cepstrum;
s3-3, the pitch period range of the human voice is 50 hz-700 hz, the maximum value of the excitation source impulse is searched in the cepstrum in the range, if the impulse amplitude of the maximum value exceeds 0.08, the position of the peak point A is recorded and judged as voiced, otherwise, the position is unvoiced and the frame is skipped;
s3-4, because the cepstrum loses phase information of the signal during calculation, when the cepstrum is judged to be voiced, the signal is separated on a complex cepstrum, the signal is divided into vocal tract response and glottal excitation on the complex cepstrum by taking a point A as a demarcation point, in order to retain all the glottal information, the vocal tract information is gradually contained, the point A is moved towards an original point, the moving distance is marked as L, L is b, A, the moved end point is marked as A1, wherein b is an adjustable parameter, and b is more than or equal to 0 and less than or equal to 1;
s3-4, according to the symmetry of the complex cepstrum, the symmetrical point at the point A1Taking an original point signal, and combining two symmetrical signals to set
Figure BDA0001525454330000052
(n) to
Figure BDA0001525454330000053
(n) performing inverse complex cepstrum transformation to reconstruct the time domain signal x1(n) reconstructed speech signal x1(n) only partial channel information and entire glottal information.
In step S4, feature extraction is performed on the reconstructed speech signal, and a frame length of 256 points and a frame shift of 128 are adopted, which specifically includes the following steps:
s4-1-1, take x1(n) carrying out short-time Fourier transform on the voice signal to obtain a frequency spectrum, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;
s4-1-2, inputting the energy spectrum obtained in the S4-1-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-1-3, obtaining static 12-order CSS-MFCC parameters by discrete cosine transform of the logarithmic energy obtained in S4-1-2;
s4-1-4, performing first order difference on the CSS-MFCC coefficients in the S4-1-3 to obtain dynamic 12-order CSS-MFCC parameters;
s4-1-5, combining parameter results in S4-1-3 and S4-1-4 together to finally form 24-order MFCC parameters, and taking 24-order CSS-MFCC mean values as global features;
s4-2-1, take x1(n) carrying out short-time Fourier transform, adding a teager energy operator to the signal through a formula, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum, wherein the teager energy operator is as follows:
ψ(x(n))=x2(n)-x(n-1)x(n+1);
s4-2-2, inputting the energy spectrum obtained in the S4-2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-2-3, obtaining static 12-order CSS-NFDMel parameters by discrete cosine transform of the logarithmic energy obtained in S4-2-2;
s4-2-4, performing first-order difference on the CSS-NFDMel coefficients in the S4-2-3 to obtain dynamic 12-order CSS-NFDMel parameters;
s4-2-5, combining the parameter results in S4-2-3 and S4-2-4 together to finally form 24-order NFD _ Mel parameter, and adopting 24-order CSS-NFDMel mean value as global feature.
In step S5, the reconstructed emotion speech library obtained in step S4 is divided into 65% of training sets and 35% of test sets, the training sets are trained by using an SVM classifier, the test sets are input into the trained training sets, and after speech recognition is performed, a decision result is output, specifically:
s5-1, extracting characteristics of the emotion voice library: the method comprises the following steps of carrying out feature combination on the mean value of pitch frequency, the short-time energy mean value, the zero-crossing rate change rate, the frequency perturbation entropy, the amplitude perturbation entropy, the Hurst index, the Mel frequency domain cepstrum coefficient (MFCC), the linear prediction coefficient LPC, the nonlinear Mel frequency domain parameter (NFD _ Mel) and the nonlinear Mel frequency domain CSS-NFDMel of a cepstrum separation signal;
s5-2, training 65% of the characteristics of S5-1 as a training set by using an SVM classifier, and the rest 35% as a test set for testing the performance of the classifier of the training set, inputting the test set into the training set after training, and outputting a judgment result after voice recognition.
Selecting the frame shift length and the value of the parameter b, and performing a feature combination experiment, wherein the combination feature comprises the following steps: the method comprises the following steps of fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value and LPC mean value. The classifiers of the method all adopt SVM.
Experiment to determine parameter b in step S3: the fundamental frequency mean value, the zero-crossing rate change rate, the short-time energy mean value, the hurst parameter, the frequency perturbation entropy, the amplitude perturbation entropy, the MFCC mean value, the NFD _ Mel mean value, the LPC mean value and the CSS-MFCC under different parameters b are adopted, and the identification rate conversion under different parameters b is obtained by utilizing the combination of various characteristics, as shown in figure 2.
According to the experiment of the recognition rate, the recognition rate is relatively stably maintained at a high level when the value of the parameter b is between 0.15 and 0.45. When the value of the parameter b is 0.34, the recognition rate is 84.01 percent at most.
Finally, in order to verify the validity of the characteristics of the method, a combined experiment of multiple characteristics is designed
Experiment one: fundamental frequency, zero-crossing rate, short-time energy;
experiment two: fundamental frequency, zero-crossing rate, short-time energy, CSS-NFDMel;
experiment three: fundamental frequency, zero-crossing rate, short-time energy, CSS-MFCC;
experiment four: the method comprises the following steps of (1) base frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value and LPC mean value;
experiment five: the method comprises the following steps of (1) fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value, LPC mean value and CSS-NFDMel;
experiment six: the method comprises the following steps of fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value, LPC mean value and CSS-MFCC.
TABLE 2 Multi-feature combination recognition table
Serial number Happy Neutral property Generating qi Sadness and sorrow Fear of Boring to Aversion to Average
Experiment one 33.33% 46.15% 35.71% 38.09% 56.52% 51.85% 86.66% 49.76%
Experiment two 54.16% 30.76% 28.57% 76.19% 78.26% 48.14% 80% 56.58%
Experiment three 58.33% 76.92% 61.90% 85.71% 69.56% 29.62% 86.66% 66.96%
Experiment four 58.72 85.01 88.03 80.33 78.61 79.72 86.57 79.57%
Experiment five 62.50% 88.46% 85.71% 80.95% 78.26% 81.48% 93.33% 81.52%
Experiment six 66.66% 88.46% 90.47% 85.71% 78.26% 85.18% 93.33% 84.01%
As can be seen from table 2, the comparison of experiment one with experiment two and experiment three verifies that CSS-MFCC and CSS-NFDMel are effective features, and experiment four, experiment five and experiment six verify that the method can be combined with multiple features to improve the recognition rate, with the highest recognition rate being experiment six and the recognition rate being 84.01%.

Claims (8)

1. A non-specific human voice emotion recognition method based on cepstrum separation signals is characterized by comprising the following steps:
s1, preprocessing an emotion voice library;
s2, extracting traditional characteristics from the preprocessed emotion voice library;
s3, performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library;
s4, extracting the characteristics of the reconstructed voice signal to obtain a reconstructed emotion voice library;
s5, dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a testing set, inputting the testing set into the training set after training by adopting an SVM classifier, and outputting a judgment result after voice recognition;
completing emotion recognition of unspecified human voice through the steps;
in step S2, the extracting traditional features is extracting traditional acoustic features of the framed speech in the emotion speech library, and the extracted acoustic features include: extracting rhythm characteristic parameters, extracting sound quality characteristics, extracting nonlinear characteristics and extracting spectral characteristics;
prosodic feature parameter extraction, comprising: mean value of pitch frequency, mean value of short-time energy and rate of change of zero crossing rate;
sound quality feature extraction, comprising: frequency perturbation entropy and amplitude perturbation entropy;
nonlinear feature extraction, comprising: a Hurst index;
spectral feature extraction, comprising: mel frequency domain cepstrum coefficient MFCC, linear prediction coefficient LPC and non-linear Mel frequency domain parameter NFD _ Mel;
the Mel frequency domain cepstrum coefficient MFCC is obtained by extracting 24 dimensions of 12-dimensional MFCC bits and first-order difference thereof, and then calculating the average value of the 24 dimensions;
the linear prediction coefficient LPC is obtained by extracting 12-dimensional LPC and calculating the average value of the 12-dimensional LPC;
the nonlinear Mel frequency domain parameter NFD _ Mel comprises the following specific calculation steps:
s2-1, firstly, performing short-time Fourier transform on each frame of signal after framing, then adding a Teager energy operator, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;
s2-2, inputting the energy spectrum obtained in the S2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s2-3, performing discrete cosine transform on the logarithmic energy obtained in the S2-2 to obtain a static 12-order NFD _ Mel parameter;
s2-4, carrying out first-order difference on the NFD _ Mel coefficient in S2-3 to obtain a dynamic 12-order NFD _ Mel parameter;
s2-5, the parameter results in S2-3 and S2-4 are combined together to finally form the NFD _ Mel parameter of 24 th order.
2. The method according to claim 1, wherein in step S1, the emotion speech library contains 7 emotions, and the emotion speech library is windowed by frames using 16Khz sampling rate and 8bit quantization.
3. The method as claimed in claim 2, wherein the 7 emotions include neutral, angry, fear, happy, sad, hate and boring.
4. The method for non-specific human speech emotion recognition based on cepstrum separation signal, according to claim 2, wherein the framing is performed within 10-30 ms.
5. The method for non-human speech emotion recognition based on cepstrum-separated signals, as claimed in claim 2, wherein said windowing is performed using a hamming window.
6. The method for non-specific human speech emotion recognition based on cepstrum separation signal as claimed in claim 1, wherein in step S3, said cepstrum domain separation and reconstruction are performed on the speech signal, and the frame length of 256 points is adopted for framing, and the frame shift is 128, specifically:
s3-1, calculating complex cepstrum by taking each frame of signal x (n) after S1 framing, wherein each frame of speech signal x (n) is obtained by glottal pulse excitation e (n) through filtering by sound channel response v (n), that is, the method comprises the steps of
Figure FDA0002900006700000021
Performing Z transformation on x (n) to change the convolution signal into a product signal, then taking logarithm operation to change the product signal into an additive signal, and finally taking Z inverse transformation on the additive signal to obtain a complex cepstrum;
s3-2, taking each frame of signal x (n) after S1 framing to calculate a cepstrum signal, calculating to take a real part of the signal after performing Z transformation on the x (n) to perform logarithmic operation, and finally performing Z inverse transformation to obtain a cepstrum;
s3-3, the pitch period range of the human voice is 50 hz-700 hz, the maximum value of the excitation source impulse is searched in the cepstrum in the range, if the impulse amplitude of the maximum value exceeds 0.08, the position of the peak point A is recorded and judged as voiced, otherwise, the position is unvoiced and the frame is skipped;
s3-4, because the cepstrum loses phase information of the signal during calculation, when the cepstrum is judged to be voiced, the signal is separated on a complex cepstrum, the signal is divided into vocal tract response and glottal excitation on the complex cepstrum by taking a point A as a demarcation point, in order to retain all the glottal information, the vocal tract information is gradually contained, the point A is moved towards an original point, the moving distance is marked as L, L is b, A, the moved end point is marked as A1, wherein b is an adjustable parameter, and b is more than or equal to 0 and less than or equal to 1;
s3-4, according to the symmetry of the complex cepstrum, taking the symmetrical point at the point A1To the origin signal, and combining two symmetrical signals to set
Figure FDA0002900006700000031
To pair
Figure FDA0002900006700000032
Performing inverse transform of the complex cepstrum to reconstruct the time domain signal x1(n) reconstructed speech signal x1(n) only partial channel information and entire glottal information.
7. The method for non-specific human speech emotion recognition based on cepstrum separation signal according to claim 1, characterized in that, in step S4, feature extraction is performed on the reconstructed speech signal, and 256-point frame length and frame shift 128 are adopted, specifically comprising the following steps:
s4-1-1, take x1(n) carrying out short-time Fourier transform on the voice signal to obtain a frequency spectrum, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;
s4-1-2, inputting the energy spectrum obtained in the S4-1-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-1-3, obtaining static 12-order CSS-MFCC parameters by discrete cosine transform of the logarithmic energy obtained in S4-1-2;
s4-1-4, performing first order difference on the CSS-MFCC coefficients in the S4-1-3 to obtain dynamic 12-order CSS-MFCC parameters;
s4-1-5, combining parameter results in S4-1-3 and S4-1-4 together to finally form 24-order MFCC parameters, and taking 24-order CSS-MFCC mean values as global features;
s4-2-1, take x1(n) carrying out short-time Fourier transform, adding a teager energy operator to the signal through a formula, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum, wherein the teager energy operator is as follows:
ψ(x(n))=x2(n)-x(n-1)x(n+1);
s4-2-2, inputting the energy spectrum obtained in the S4-2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;
s4-2-3, obtaining static 12-order CSS-NFDMel parameters by discrete cosine transform of the logarithmic energy obtained in S4-2-2;
s4-2-4, performing first-order difference on the CSS-NFDMel coefficients in the S4-2-3 to obtain dynamic 12-order CSS-NFDMel parameters;
s4-2-5, combining the parameter results in S4-2-3 and S4-2-4 together to finally form 24-order NFD _ Mel parameter, and adopting 24-order CSS-NFDMel mean value as global feature.
8. The method for non-specific human speech emotion recognition based on cepstrum separation signal as claimed in claim 1, wherein in step S5, the reconstructed emotion speech library after step S4 is divided into 65% training set and 35% testing set, the training set is trained by SVM classifier, the testing set is input into the training set after training, and after speech recognition, the decision result is output, specifically:
s5-1, extracting characteristics of the emotion voice library: performing characteristic combination on the mean value of pitch frequency, the short-time energy mean value, the zero-crossing rate change rate, the frequency perturbation entropy, the amplitude perturbation entropy, the Hurst index, the Mel frequency domain cepstrum coefficient MFCC, the linear prediction coefficient LPC and the nonlinear Mel frequency domain parameter NFD _ Mel;
s5-2, training 65% of the characteristics of S5-1 as a training set by using an SVM classifier, and the rest 35% as a test set for testing the performance of the classifier of the training set, inputting the test set into the training set after training, and outputting a judgment result after voice recognition.
CN201711434048.3A 2017-12-26 2017-12-26 Non-specific human voice emotion recognition method based on cepstrum separation signal Active CN108154879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711434048.3A CN108154879B (en) 2017-12-26 2017-12-26 Non-specific human voice emotion recognition method based on cepstrum separation signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711434048.3A CN108154879B (en) 2017-12-26 2017-12-26 Non-specific human voice emotion recognition method based on cepstrum separation signal

Publications (2)

Publication Number Publication Date
CN108154879A CN108154879A (en) 2018-06-12
CN108154879B true CN108154879B (en) 2021-04-09

Family

ID=62461990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711434048.3A Active CN108154879B (en) 2017-12-26 2017-12-26 Non-specific human voice emotion recognition method based on cepstrum separation signal

Country Status (1)

Country Link
CN (1) CN108154879B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036467B (en) * 2018-10-26 2021-04-16 南京邮电大学 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN110910897B (en) * 2019-12-05 2023-06-09 四川超影科技有限公司 Feature extraction method for motor abnormal sound recognition
CN111524535B (en) * 2020-04-30 2022-06-21 杭州电子科技大学 Feature fusion method for speech emotion recognition based on attention mechanism
CN112599149A (en) * 2020-12-10 2021-04-02 中国传媒大学 Detection method and device for replay attack voice
CN113257279A (en) * 2021-03-24 2021-08-13 厦门大学 GTCN-based real-time voice emotion recognition method and application device
CN112712824B (en) * 2021-03-26 2021-06-29 之江实验室 Crowd information fused speech emotion recognition method and system
CN113555038B (en) * 2021-07-05 2023-12-29 东南大学 Speaker-independent voice emotion recognition method and system based on unsupervised domain countermeasure learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004000138A (en) * 2002-01-11 2004-01-08 Samsung Electronics Co Ltd Method and apparatus for grasping condition of animal by using acquisition and analysis of biomedical signal
CN102411932A (en) * 2011-09-30 2012-04-11 北京航空航天大学 Methods for extracting and modeling Chinese speech emotion in combination with glottis excitation and sound channel modulation information
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN106653000A (en) * 2016-11-16 2017-05-10 太原理工大学 Emotion intensity test method based on voice information
CN106992000A (en) * 2017-04-07 2017-07-28 安徽建筑大学 A kind of old man's speech-emotion recognition method of the multiple features fusion based on prediction
CN106991627A (en) * 2017-03-28 2017-07-28 广西师范大学 The distributed intelligence tutoring system acted on behalf of based on domain body and more

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595005B2 (en) * 2010-05-31 2013-11-26 Simple Emotion, Inc. System and method for recognizing emotional state from a speech signal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004000138A (en) * 2002-01-11 2004-01-08 Samsung Electronics Co Ltd Method and apparatus for grasping condition of animal by using acquisition and analysis of biomedical signal
CN102411932A (en) * 2011-09-30 2012-04-11 北京航空航天大学 Methods for extracting and modeling Chinese speech emotion in combination with glottis excitation and sound channel modulation information
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN106653000A (en) * 2016-11-16 2017-05-10 太原理工大学 Emotion intensity test method based on voice information
CN106991627A (en) * 2017-03-28 2017-07-28 广西师范大学 The distributed intelligence tutoring system acted on behalf of based on domain body and more
CN106992000A (en) * 2017-04-07 2017-07-28 安徽建筑大学 A kind of old man's speech-emotion recognition method of the multiple features fusion based on prediction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Speech Emotion Recognition Using Non-Linear;Onur Erdem Korkmaz1, Ayten Atasoy;《IEEE:2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO)》;20121018;2045-2049 *
语音情感识别中的特征提取与识别算法研究;孙亚新;《中国博士学位论文全文数据库信息科技辑》;20160115(第01期);全文 *
语音情感识别的研究;谢玲;《中国优秀硕士学位论文全文数据库信息科技辑》;20170215(第02期);全文 *

Also Published As

Publication number Publication date
CN108154879A (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN108154879B (en) Non-specific human voice emotion recognition method based on cepstrum separation signal
Muda et al. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques
Ramamohan et al. Sinusoidal model-based analysis and classification of stressed speech
Singh et al. An approach to extract feature using MFCC
US10410623B2 (en) Method and system for generating advanced feature discrimination vectors for use in speech recognition
Demircan et al. Feature extraction from speech data for emotion recognition
US8930185B2 (en) Speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program
CN104021789A (en) Self-adaption endpoint detection method using short-time time-frequency value
Sathe-Pathak et al. Extraction of Pitch and Formants and its Analysis to identify 3 different emotional states of a person
Georgogiannis et al. Speech emotion recognition using non-linear teager energy based features in noisy environments
CN102655003B (en) Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Wanli et al. The research of feature extraction based on MFCC for speaker recognition
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
Waghmare et al. Emotion recognition system from artificial marathi speech using MFCC and LDA techniques
Jhawar et al. Speech disorder recognition using MFCC
CN108682432B (en) Speech emotion recognition device
Linh et al. MFCC-DTW algorithm for speech recognition in an intelligent wheelchair
Gangamohan et al. A Flexible Analysis Synthesis Tool (FAST) for studying the characteristic features of emotion in speech
Sethu et al. Empirical mode decomposition based weighted frequency feature for speech-based emotion classification
CN106297769A (en) A kind of distinctive feature extracting method being applied to languages identification
CN112151066A (en) Voice feature recognition-based language conflict monitoring method, medium and equipment
Lee et al. Speech emotion recognition using spectral entropy
Jawarkar et al. Speaker identification using whispered speech
Khulage Extraction of pitch, duration and formant frequencies for emotion recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant