CN108154879B

CN108154879B - Non-specific human voice emotion recognition method based on cepstrum separation signal

Info

Publication number: CN108154879B
Application number: CN201711434048.3A
Authority: CN
Inventors: 胡维平; 郝梓岚; 王艳
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2021-04-09
Anticipated expiration: 2037-12-26
Also published as: CN108154879A

Abstract

The invention discloses a non-specific human voice emotion recognition method based on cepstrum separation signals, which specifically comprises the following steps: preprocessing an emotion voice library; extracting traditional characteristics from the preprocessed emotion voice library; performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library; performing feature extraction on the reconstructed voice signal to obtain a reconstructed emotion voice library; dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a test set, inputting the test set into the trained training set after the training set is trained by an SVM classifier, and outputting a judgment result after voice recognition; the recognition method can effectively improve the speech emotion recognition rate of the unspecified person.

Description

Non-specific human voice emotion recognition method based on cepstrum separation signal

Technical Field

The invention relates to the technical field of nonspecific human voice recognition, in particular to a method for recognizing nonspecific human voice emotion based on cepstrum separation signals.

Background

The glottis and the vocal tract signals contain rich emotion information, and due to the difference of personal vocal tracts, the vocal tract information generally contains more personal characteristics, which generates much interference to the emotion recognition work of non-specific people. In the previous work of spectral feature extraction, we extract features of the whole speech signal, and such features carry a lot of personal information of speakers. Such features are often effective for emotion recognition of a particular person. But the emotion recognition effect for non-specific persons is not as good as that for specific persons.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a non-specific human voice emotion recognition method based on cepstrum separation signals.

The technical scheme for realizing the purpose of the invention is as follows:

a non-specific human voice emotion recognition method based on cepstrum separation signals specifically comprises the following steps:

s1, preprocessing an emotion voice library;

s2, extracting traditional characteristics from the preprocessed emotion voice library;

s3, performing cepstrum domain separation and reconstruction on the processed voice signals of the emotion voice library;

s4, extracting the characteristics of the reconstructed voice signal to obtain a reconstructed emotion voice library;

s5, dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a testing set, inputting the testing set into a trained classifier after the training set is trained by an SVM classifier, and outputting a judgment result after voice recognition;

and completing emotion recognition of unspecific human voice through the steps.

In step S1, the emotion speech library contains 7 emotions, and is subjected to frame windowing processing by using a 16Khz sampling rate and 8bit quantization.

The 7 emotions comprise neutrality, anger, fear, happiness, sadness, aversion and boredom.

And the framing is carried out within 10-30 ms.

And the windowing adopts a Hamming window.

In step S2, the extracting traditional features is extracting traditional acoustic features of the framed speech in the emotion speech library, and the extracted acoustic features include: extracting rhythm characteristic parameters, extracting sound quality characteristics, extracting nonlinear characteristics and extracting spectral characteristics;

prosodic feature parameter extraction, comprising: mean value of pitch frequency, mean value of short-time energy and rate of change of zero crossing rate;

sound quality feature extraction, comprising: frequency perturbation entropy and amplitude perturbation entropy;

nonlinear feature extraction, comprising: a Hurst index;

spectral feature extraction, comprising: mel frequency domain cepstral coefficients (MFCC), linear prediction coefficients LPC, and non-linear Mel frequency domain parameters (NFD _ Mel);

the Mel frequency domain cepstrum coefficient (MFCC) is obtained by extracting 12-dimensional MFCC bits and 24-dimensional first-order difference, and calculating average value.

The linear prediction coefficient LPC is obtained by extracting 12-dimensional LPC and calculating the average value of the 12-dimensional LPC;

the nonlinear Mel frequency domain parameter (NFD _ Mel) is calculated by the following steps:

s2-1, firstly, performing short-time Fourier transform on each frame of signal after S1 framing, then adding a Teager energy operator, and taking the frequency spectrum amplitude as the power of 2 to obtain an energy spectrum;

s2-2, inputting the energy spectrum obtained in the S2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;

s2-3, performing discrete cosine transform on the logarithmic energy obtained in the S2-2 to obtain a static 12-order NFD _ Mel parameter;

s2-4, carrying out first-order difference on the NFD _ Mel coefficient in S2-3 to obtain a dynamic 12-order NFD _ Mel parameter;

s2-5, the parameter results in S2-3 and S2-4 are combined together to finally form the NFD _ Mel parameter of 24 th order.

In step S3, the cepstrum domain separation and reconstruction are performed on the speech signal, where the framing adopts a 256-point frame length, and the frame shift is 128, specifically:

s3-1, calculating complex cepstrum by taking the signal x (n) of each frame after S1 framing, wherein the signal x (n) of each frame is excited by glottal pulse e (n) and filtered by sound channel response v (n)Is obtained by wave, i.e.

Performing Z transformation on x (n) to change the convolution signal into a product signal, then taking logarithm operation to change the product signal into an additive signal, and finally taking Z inverse transformation on the additive signal to obtain a complex cepstrum;

s3-2, taking each frame of signal x (n) after S1 framing to calculate a cepstrum signal, calculating to take a real part of the signal after performing Z transformation on the x (n) to perform logarithmic operation, and finally performing Z inverse transformation to obtain a cepstrum;

s3-3, the pitch period range of the human voice is 50 hz-700 hz, the maximum value of the excitation source impulse is searched in the cepstrum in the range, if the impulse amplitude of the maximum value exceeds 0.08, the position of the peak point A is recorded and judged as voiced, otherwise, the position is unvoiced and the frame is skipped;

s3-4, because the cepstrum loses phase information of the signal during calculation, when the cepstrum is judged to be voiced, the signal is separated on a complex cepstrum, the signal is divided into vocal tract response and glottal excitation on the complex cepstrum by taking a point A as a demarcation point, in order to retain all the glottal information, the vocal tract information is gradually contained, the point A is moved towards an original point, the moving distance is marked as L, L is b, A, the moved end point is marked as A1, wherein b is an adjustable parameter, and b is more than or equal to 0 and less than or equal to 1;

s3-4, according to the symmetry of the complex cepstrum, the origin signal is obtained from the symmetrical point of the point A1, and the two symmetrical signals are combined to be set as

To pair

Performing inverse transform of the complex cepstrum to reconstruct the time domain signal x₁(n) reconstructed speech signal x₁(n) only partial channel information and entire glottal information.

In step S4, feature extraction is performed on the reconstructed speech signal, and a frame length of 256 points and a frame shift of 128 are adopted, which specifically includes the following steps:

s4-1-1, take x₁(n) carrying out short-time Fourier transform on the voice signal to obtain a frequency spectrum, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;

s4-1-2, inputting the energy spectrum obtained in the S4-1-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;

s4-1-3, obtaining static 12-order CSS-MFCC parameters by discrete cosine transform of the logarithmic energy obtained in S4-1-2;

s4-1-4, performing first order difference on the CSS-MFCC coefficients in the S4-1-3 to obtain dynamic 12-order CSS-MFCC parameters;

s4-1-5, combining parameter results in S4-1-3 and S4-1-4 together to finally form 24-order MFCC parameters, and taking 24-order CSS-MFCC mean values as global features;

s4-2-1, take x₁(n) carrying out short-time Fourier transform, adding a teager energy operator to the signal through a formula, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum, wherein the teager energy operator is as follows:

ψ(x(n))＝x²(n)-x(n-1)x(n+1)；

s4-2-2, inputting the energy spectrum obtained in the S4-2-1 into a Mel frequency filter bank, and obtaining the logarithmic energy output by each filter;

s4-2-3, obtaining static 12-order CSS-NFDMel parameters by discrete cosine transform of the logarithmic energy obtained in S4-2-2;

s4-2-4, performing first-order difference on the CSS-NFDMel coefficients in the S4-2-3 to obtain dynamic 12-order CSS-NFDMel parameters;

s4-2-5, combining the parameter results in S4-2-3 and S4-2-4 together to finally form 24-order NFD _ Mel parameter, and adopting 24-order CSS-NFDMel mean value as global feature.

In step S5, the reconstructed emotion speech library obtained in step S4 is divided into 65% of training sets and 35% of test sets, the training sets are trained by using an SVM classifier, the test sets are input into the trained training sets, and after speech recognition is performed, a decision result is output, specifically:

s5-1, extracting characteristics of the emotion voice library: the method comprises the following steps of combining features of a mean value of pitch frequency, a short-time energy mean value, a zero-crossing rate change rate, a frequency perturbation entropy, an amplitude perturbation entropy, a Hurst index, a Mel frequency domain cepstrum coefficient (MFCC), a linear prediction coefficient LPC and a non-linear Mel frequency domain parameter (NFD _ Mel);

s5-2, training 65% of the characteristics of S5-1 as a training set by using an SVM classifier, and the rest 35% as a test set for testing the performance of the classifier of the training set, inputting the test set into the training set after training, and outputting a judgment result after voice recognition.

Has the advantages that: the method for recognizing the speech emotion of the unspecific person based on the cepstrum separation signal combines the new features with the traditional features, retains vocal cord information, abandons part of vocal tract information, searches for an optimal separation point, extracts features from the processed signal, and can effectively improve the speech emotion recognition rate of the unspecific person.

Drawings

FIG. 1 is a cepstral separation signal flow diagram;

FIG. 2 is a diagram showing the recognition rate of CSS-MFCC and combined features.

Detailed Description

The invention is further illustrated but not limited by the following figures and examples.

Example (b):

s1, preprocessing an emotion voice library;

s5, dividing the reconstructed emotion voice library obtained in the step S4 into a training set and a testing set, inputting the testing set into the training set after training by adopting an SVM classifier, and outputting a judgment result after voice recognition;

and completing emotion recognition of unspecific human voice through the steps.

And the framing is carried out within 10-30 ms.

And the windowing adopts a Hamming window.

nonlinear feature extraction, comprising: a Hurst index;

s2-1, firstly, performing short-time Fourier transform on each frame of signal after framing, then adding a Teager energy operator, and taking the frequency spectrum amplitude as a power of 2 to obtain an energy spectrum;

In step S3, performing cepstrum domain separation and reconstruction on the speech signal, and adopting a frame length of 256 points and a frame shift of 128, as shown in fig. 1, specifically:

s3-1, calculating complex cepstrum by taking each frame of signal x (n) after S1 framing, wherein each frame of speech signal x (n) is obtained by glottal pulse excitation e (n) through filtering by sound channel response v (n), that is, the method comprises the steps of

s3-4, according to the symmetry of the complex cepstrum, the symmetrical point at the point A1Taking an original point signal, and combining two symmetrical signals to set

(n) to

(n) performing inverse complex cepstrum transformation to reconstruct the time domain signal x₁(n) reconstructed speech signal x₁(n) only partial channel information and entire glottal information.

ψ(x(n))＝x²(n)-x(n-1)x(n+1)；

s5-1, extracting characteristics of the emotion voice library: the method comprises the following steps of carrying out feature combination on the mean value of pitch frequency, the short-time energy mean value, the zero-crossing rate change rate, the frequency perturbation entropy, the amplitude perturbation entropy, the Hurst index, the Mel frequency domain cepstrum coefficient (MFCC), the linear prediction coefficient LPC, the nonlinear Mel frequency domain parameter (NFD _ Mel) and the nonlinear Mel frequency domain CSS-NFDMel of a cepstrum separation signal;

Selecting the frame shift length and the value of the parameter b, and performing a feature combination experiment, wherein the combination feature comprises the following steps: the method comprises the following steps of fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value and LPC mean value. The classifiers of the method all adopt SVM.

Experiment to determine parameter b in step S3: the fundamental frequency mean value, the zero-crossing rate change rate, the short-time energy mean value, the hurst parameter, the frequency perturbation entropy, the amplitude perturbation entropy, the MFCC mean value, the NFD _ Mel mean value, the LPC mean value and the CSS-MFCC under different parameters b are adopted, and the identification rate conversion under different parameters b is obtained by utilizing the combination of various characteristics, as shown in figure 2.

According to the experiment of the recognition rate, the recognition rate is relatively stably maintained at a high level when the value of the parameter b is between 0.15 and 0.45. When the value of the parameter b is 0.34, the recognition rate is 84.01 percent at most.

Finally, in order to verify the validity of the characteristics of the method, a combined experiment of multiple characteristics is designed

Experiment one: fundamental frequency, zero-crossing rate, short-time energy;

experiment two: fundamental frequency, zero-crossing rate, short-time energy, CSS-NFDMel;

experiment three: fundamental frequency, zero-crossing rate, short-time energy, CSS-MFCC;

experiment four: the method comprises the following steps of (1) base frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value and LPC mean value;

experiment five: the method comprises the following steps of (1) fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value, LPC mean value and CSS-NFDMel;

experiment six: the method comprises the following steps of fundamental frequency mean value, zero crossing rate change rate, short-time energy mean value, hurst parameter, frequency perturbation entropy, amplitude perturbation entropy, MFCC mean value, NFD _ Mel mean value, LPC mean value and CSS-MFCC.

TABLE 2 Multi-feature combination recognition table

Serial number

Happy

Neutral property

Generating qi

Sadness and sorrow

Fear of

Boring to

Aversion to

Average

Experiment one

33.33％

46.15％

35.71％

38.09％

56.52％

51.85％

86.66％

49.76％

Experiment two

54.16％

30.76％

28.57％

76.19％

78.26％

48.14％

80％

56.58％

Experiment three

58.33％

76.92％

61.90％

85.71％

69.56％

29.62％

86.66％

66.96％

Experiment four

58.72

85.01

88.03

80.33

78.61

79.72

86.57

79.57％

Experiment five

62.50％

88.46％

85.71％

80.95％

78.26％

81.48％

93.33％

81.52％

Experiment six

66.66％

88.46％

90.47％

85.71％

78.26％

85.18％

93.33％

84.01％

As can be seen from table 2, the comparison of experiment one with experiment two and experiment three verifies that CSS-MFCC and CSS-NFDMel are effective features, and experiment four, experiment five and experiment six verify that the method can be combined with multiple features to improve the recognition rate, with the highest recognition rate being experiment six and the recognition rate being 84.01%.

Claims

1. A non-specific human voice emotion recognition method based on cepstrum separation signals is characterized by comprising the following steps:

s1, preprocessing an emotion voice library;

completing emotion recognition of unspecified human voice through the steps;

nonlinear feature extraction, comprising: a Hurst index;

spectral feature extraction, comprising: mel frequency domain cepstrum coefficient MFCC, linear prediction coefficient LPC and non-linear Mel frequency domain parameter NFD _ Mel;

the Mel frequency domain cepstrum coefficient MFCC is obtained by extracting 24 dimensions of 12-dimensional MFCC bits and first-order difference thereof, and then calculating the average value of the 24 dimensions;

the nonlinear Mel frequency domain parameter NFD _ Mel comprises the following specific calculation steps:

2. The method according to claim 1, wherein in step S1, the emotion speech library contains 7 emotions, and the emotion speech library is windowed by frames using 16Khz sampling rate and 8bit quantization.

3. The method as claimed in claim 2, wherein the 7 emotions include neutral, angry, fear, happy, sad, hate and boring.

4. The method for non-specific human speech emotion recognition based on cepstrum separation signal, according to claim 2, wherein the framing is performed within 10-30 ms.

5. The method for non-human speech emotion recognition based on cepstrum-separated signals, as claimed in claim 2, wherein said windowing is performed using a hamming window.

6. The method for non-specific human speech emotion recognition based on cepstrum separation signal as claimed in claim 1, wherein in step S3, said cepstrum domain separation and reconstruction are performed on the speech signal, and the frame length of 256 points is adopted for framing, and the frame shift is 128, specifically:

s3-4, according to the symmetry of the complex cepstrum, taking the symmetrical point at the point A1To the origin signal, and combining two symmetrical signals to set

To pair

7. The method for non-specific human speech emotion recognition based on cepstrum separation signal according to claim 1, characterized in that, in step S4, feature extraction is performed on the reconstructed speech signal, and 256-point frame length and frame shift 128 are adopted, specifically comprising the following steps:

ψ(x(n))＝x²(n)-x(n-1)x(n+1)；

8. The method for non-specific human speech emotion recognition based on cepstrum separation signal as claimed in claim 1, wherein in step S5, the reconstructed emotion speech library after step S4 is divided into 65% training set and 35% testing set, the training set is trained by SVM classifier, the testing set is input into the training set after training, and after speech recognition, the decision result is output, specifically:

s5-1, extracting characteristics of the emotion voice library: performing characteristic combination on the mean value of pitch frequency, the short-time energy mean value, the zero-crossing rate change rate, the frequency perturbation entropy, the amplitude perturbation entropy, the Hurst index, the Mel frequency domain cepstrum coefficient MFCC, the linear prediction coefficient LPC and the nonlinear Mel frequency domain parameter NFD _ Mel;