CN108305639B

CN108305639B - Speech emotion recognition method, computer-readable storage medium and terminal

Info

Publication number: CN108305639B
Application number: CN201810455163.7A
Authority: CN
Inventors: 邓立新; 王思羽
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2021-03-09
Anticipated expiration: 2038-05-11
Also published as: CN108305639A

Abstract

A speech emotion recognition method and device, a computer readable storage medium and a terminal are provided, and the method comprises the following steps: acquiring a voice signal to be processed; preprocessing the acquired voice signal to obtain a preprocessed voice signal; extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC; forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal; and training and identifying the characteristic vector sequence corresponding to the voice signal by adopting a support vector machine to obtain a corresponding voice emotion identification result. By the scheme, the accuracy of speech emotion recognition can be improved.

Description

Speech emotion recognition method, computer-readable storage medium and terminal

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice emotion recognition method, a computer readable storage medium and a terminal.

Background

With the rapid development of information technology and the continuous enhancement of human dependence on computers, the capability of human-computer interaction is more and more emphasized by researchers. In fact, the problems to be solved in human-computer interaction are consistent with important factors in human-to-human communication, and the most important factor is the capacity of 'emotional intelligence'.

At present, the research on emotion information processing is being intensively conducted, and the research on emotion information processing in speech signals is receiving more and more attention. The speech emotion recognition refers to a technology for processing and recognizing speech signals by using a signal processing technology and a mode recognition method to judge which emotion the speech belongs to.

However, the conventional speech emotion recognition method has a problem of low recognition accuracy.

Disclosure of Invention

The invention solves the technical problem of how to improve the accuracy of speech emotion recognition.

In order to solve the above technical problem, an embodiment of the present invention provides a speech emotion recognition method, where the method includes:

acquiring a voice signal to be processed;

preprocessing the acquired voice signal to obtain a preprocessed voice signal;

extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC;

forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal;

and training and identifying the characteristic vector sequence corresponding to the voice signal by adopting a support vector machine to obtain a corresponding voice emotion identification result.

Optionally, the preprocessing the acquired voice signal includes:

the acquired speech signal is sampled and quantized, pre-emphasized, frame-windowed, short-time energy analyzed and end-point detected.

Optionally, when performing endpoint detection on the speech signal, determining a speech start frame in the following manner:

traversing a plurality of frames obtained after preprocessing to obtain traversed current frames;

calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame;

when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame;

and when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as the voice starting frame of the voice signal.

Optionally, the short-time energy of the preprocessed speech signal and its derived parameters include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, short-time energy jitter, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing.

Optionally, the pitch frequency and its derived parameters of the preprocessed speech signal include pitch frequency, maximum pitch frequency, minimum pitch frequency, mean pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the preprocessing, and a sum satisfying F (i) x F (i +1) |! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.

Optionally, the psychoacoustic feature formants of the preprocessed speech signal and the derived parameters thereof include a maximum of the first, second and third formant frequencies, a minimum of the first, second and third formant frequencies, a mean of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum of the second formant frequency ratio and a minimum of the second formant frequency ratio and a mean of the second formant frequency ratio of each of the plurality of frames obtained after the preprocessing.

Optionally, the Mel cepstrum coefficients of 20 th order obtained by the MFCC comprise MFCC of 1-6 th order, Mid-MFCC of 3-10 th order and I-MFCC of 7-12 th order.

Optionally, the 1-6 order MFCC, the 3-10 order Mid-MFCC, and the 7-12 order I-MFCC are calculated by the following formulas:

wherein f is_MelRepresenting the frequency, f, of the MFCC_Mid-MelFrequency, f, representing Mid-MFCC_I-MelDenotes the frequency of the I-MFCC, and f denotes the actual frequency.

The embodiment of the invention also provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the speech emotion recognition method are executed.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of being operated on the processor, and the processor executes the steps of the speech emotion recognition method when executing the computer instructions.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the scheme, the characteristic parameters including the short-time energy and the derived parameters thereof, the fundamental tone frequency and the derived parameters thereof, the tone characteristic formant and the derived parameters thereof and the 20-order Mel cepstrum coefficient obtained by MFCC are extracted from the preprocessed voice signals, the extracted characteristic parameters of the 20-order Mel cepstrum coefficient obtained by MFCC cover the full frequency domain, and compared with the MFCC parameters only covering the low frequency, the recognition precision of the middle and high frequency domains can be improved, so that the accuracy of voice emotion recognition can be improved, and the use experience of a user can be improved.

Further, when detecting the end point of the voice signal, firstly calculating the short-time energy of the traversed current frame and the continuous preset number of frames after the traversed current frame, when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame are determined to be larger than or equal to the short-time energy of the initial silent section speech signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame, and when the ratio obtained by calculation is determined to be greater than or equal to a preset threshold value, determining that the traversed current frame is the voice initial frame of the voice signal, because the judgment of whether the short-time energy of the continuous preset number of frames including the current frame is greater than or equal to the short-time energy of the initial silent section voice signal is firstly carried out, the influence of glitch interference on the endpoint detection can be reduced or even avoided, so that the accuracy of the endpoint detection can be improved.

Drawings

FIG. 1 is a flow chart of a speech emotion recognition method in an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a speech emotion recognition apparatus in an embodiment of the present invention.

Detailed Description

According to the technical scheme, the voice emotion recognition method and device based on the voice emotion recognition are characterized in that the feature parameters including short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone characteristic formants and derived parameters thereof and 20-order Mel cepstrum coefficients obtained by MFCC are extracted from the preprocessed voice signals, the extracted feature parameters of the 20-order Mel cepstrum coefficients obtained by MFCC cover the full frequency domain, and compared with MFCC parameters only covering low frequency, the recognition accuracy of the middle and high frequency domains can be improved, so that the accuracy of voice emotion recognition can be improved, and the use experience of a user is improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

FIG. 1 is a flowchart illustrating a speech emotion recognition method according to an embodiment of the present invention. Referring to fig. 1, a speech emotion recognition method, the method comprising:

step S101: and acquiring a voice signal to be processed.

In a specific implementation, the voice signal to be processed is a digital signal obtained by performing analog-to-digital conversion on a corresponding analog signal.

Step S102: and preprocessing the acquired voice signal to obtain a preprocessed voice signal.

In a specific implementation, when preprocessing the acquired speech signal, the acquired speech signal is first sampled and quantized, pre-emphasized, frame-wise windowed, short-time energy analyzed, and end-point detected.

In one embodiment of the present invention, a first-order digital filter μ is used for pre-emphasis processing of a speech signal: h (z) ═ 1-. mu.z^-1Mu is 0.98; the frame length adopted during framing is 320, and the frame shift is 80; the windowing is performed using a hamming window.

In an embodiment of the present invention, in order to improve the accuracy of short-point detection, a two-stage detection method is used to detect the language endpoint. Specifically, the method comprises the following steps:

firstly, the voice signals of a plurality of frames obtained after preprocessing can be traversed according to the sequence, the short-time energy of the voice signals of the traversed current frame and the subsequent continuous frames with the preset number is calculated, and the short-time energy of the traversed current frame and the subsequent continuous frames with the preset number is respectively compared with the short-time energy of the voice signal of the initial silent section, so as to determine whether the short-time energy of the traversed current frame and the subsequent continuous frames with the preset number is greater than or equal to the short-time energy of the voice signal of the initial silent section.

Then, when it is determined that the short-time energies of the traversed current frame and the subsequent consecutive frames of the preset number are both greater than or equal to the short-time energy of the initial unvoiced segment speech signal, a ratio of the short-time energies between the traversed current frame and the next frame may be calculated, and when it is determined that the calculated ratio is greater than or equal to a preset threshold, the traversed current frame is determined to be a speech start frame of the speech signal.

For example, the time domain expression of the speech signal is x (l), and the nth frame speech signal obtained after preprocessing such as framing and windowing is x_n(m) is an example, its short-time energy E_nComprises the following steps:

where N denotes a frame length of each frame.

The first stage is to detect the voice signals of five continuous frames and judge whether the short-time energy of the voice signals of five continuous frames meets the following conditions:

E_i≥Ti(IS)，i∈{m，m+1，m+2，m+3，m+4} (2)

wherein IS IS the average duration of initial silence period of voice, T_i(IS) represents the short-time energy level of the initial silence segment speech signal, i.e., the background noise.

Through foretell first section detection, can avoid the burr to disturb the influence that produces the terminal detection.

When the first section is detected without errors, namely the short-time energy of the voice signals of five continuous frames is determined to be more than or equal to the short-time energy of the voice signals of the initial silent section, the second section ratio judgment is entered, namely:

wherein, let σ_nFor the second detection threshold, i.e. the preset threshold, σ_nThe ratio of the short-time energy of the speech signal of the next frame to the short-time energy of the speech signal of the previous frame in the two adjacent frames is used for judging the initial segment of the speech. The nth frame of voice data satisfying the above formula is the start frame of the voice.

The two-segment detection method is adopted to detect the language endpoint, so that the accuracy of voice initial frame detection can be improved, and the accuracy of voice emotion recognition can be improved.

Step S103: extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone quality characteristic formants and derived parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC.

In an embodiment of the present invention, the short-time energy of the preprocessed speech signal and the derived parameters thereof include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, a jitter of the short-time energy, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing.

Because the energy of the voice signal has strong correlation with the expression of emotion, firstly, the short-time energy of the voice signals of a plurality of frames after preprocessing is respectively calculated by adopting a calculation formula (1), and the maximum value of the short-time energy, the minimum value of the short-time energy, the mean value of the short-time energy and the variance of the short-time energy in the short-time energy of the voice signals of the plurality of frames after preprocessing are obtained.

And then calculating the short-time energy jitter, the linear regression coefficient of the short-time energy, the mean square error of the linear regression coefficient of the short-time energy and the proportion of the short-time energy below 250Hz to the total short-time energy of the preprocessed voice signals of a plurality of frames. Wherein:

in an embodiment of the present invention, the short-time energy jitter of the preprocessed speech signals of a plurality of frames is calculated by using the following formula:

wherein E is_sIndicating short-time energy jitter and M indicating the total frame number.

In an embodiment of the present invention, the linear regression coefficient of the short-time energy of the preprocessed speech signals of the plurality of frames is calculated by using the following formula:

wherein E is_rRepresents the short-time energy linear regression coefficient, and M represents the total frame number.

In an embodiment of the present invention, the mean square error of the linear regression coefficient of the short-time energy of the preprocessed speech signals of the plurality of frames is calculated by using the following formula:

wherein E is_qThe mean square error of the linear regression coefficients representing the short-time energy.

In an embodiment of the present invention, a ratio of the short-time energy of the preprocessed voice signals of the plurality of frames to the total short-time energy of the voice signals of the preprocessed voice signals with the frequency of 250Hz is:

wherein E is₂₅₀Representing the sum of short-time energies below 250Hz in the frequency domain.

In an embodiment of the present invention, the pitch frequency and its derived parameters of the pre-processed speech signal include pitch frequency, maximum value of pitch frequency, minimum value of pitch frequency, mean value of pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the pre-processing, and a sum satisfying F (i) x F (i +1) |! The two adjacent frames that are 0 correspond to inter-voiced differential pitch. Wherein:

the frequency of the vocal cord vibrations is referred to as the pitch frequency. In one embodiment of the invention, a short-time autocorrelation function is used to obtain the pitch frequency. Specifically, an autocorrelation coefficient R corresponding to the preprocessed speech signal of each frame is first defined_n(k) And then by detecting R_n(k) And extracting a corresponding pitch period value from the position of the peak value, and then obtaining the corresponding pitch frequency by reciprocal of the obtained pitch period value sphere.

In an embodiment of the present invention, for the preprocessed n-th frame speech signal x_n(m) its autocorrelation function R_n(k) Comprises the following steps:

wherein R is_n(k) The even function has a value in the range k, which is not zero (-N +1) to (N-1).

When the pitch frequency of the speech signal of the plurality of frames after the preprocessing is obtained, the maximum value of the pitch frequency, the minimum value of the pitch frequency, the mean value of the pitch frequency, and the variance of the pitch frequency among the pitch frequencies of the speech signal of the plurality of frames after the preprocessing can be obtained.

In one embodiment of the present invention, the bass frequency of the i-th voiced frame in the processed speech signal of the frames is represented as F0_iThe total number of voiced frames of the processed speech signal of the plurality of frames is denoted M^*And if the total frame number of the processed voice signals of the frames is represented as M, the corresponding first-order pitch frequency jitter and second-order pitch frequency jitter are respectively as follows:

wherein, F0_s1Representing the first order pitch frequency jitter, F0_s2Representing the second order pitch frequency jitter.

In an embodiment of the present invention, all the speech signals satisfying F (i) × F (i +1) | in the predetermined frames! Of two adjacent frames 0, the corresponding inter-voiced differential pitch dF is:

dF(k)＝F(i)-F(i+1)，1≤k≤M^*，1≤i≤M (11)

where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.

In an embodiment of the present invention, the psychoacoustic characteristic formants of the preprocessed speech signal and the derived parameters thereof include a maximum value of the first, second and third formant frequencies, a minimum value of the first, second and third formant frequencies, a mean value of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum value of the second formant frequency ratio and a minimum value of the second formant frequency ratio and a mean value of the second formant frequency ratio of each of the plurality of frames obtained after the preprocessing.

In a specific implementation, the formant parameters include formant bandwidth and frequency, and the formant parameters are extracted based on estimating a spectral envelope of the speech signal and estimating the formant parameters from the vocal tract model by using a Linear Prediction (LPC) method.

In an embodiment of the present invention, the LPC method is used to deconvolve the speech signal, and the parameters of the global model of the vocal tract response are obtained as follows:

then, if one root of the prediction error filter a (z) is found, the formant frequency corresponding to i is:

where T is the sampling period.

The first, second and third formant frequencies of the ith voiced frame are denoted as F1 respectively_i、F2_i、F3_iThen, the second formant frequency ratio is:

F2_i/(F2_i-F1_i) (14)

when the first, second, and third formant frequencies corresponding to the voice signals of the plurality of frames are calculated by using the above formula, the maximum value, the minimum value, the mean value, the variance, and the first-order jitter of the first, second, and third formant frequencies corresponding to the voice signals of the plurality of frames, and the maximum value, the minimum value, and the mean value of the ratio of the second formant frequencies can be obtained.

In one embodiment of the invention, in order to improve the identification accuracy, the nonlinear relationship of Hz-Mel is corrected in the extraction of Mel (Mel) frequency cepstrum coefficients, and 2 new coefficients Mid-MFCC and I-MFCC are introduced. The Mid-MFCC and the I-MFCC have good calculation precision in the middle-frequency region and the high-frequency region respectively, and can be used as a supplement for low-order MFCC to realize calculation of the frequency spectrum characteristics of a full frequency domain. The filters of the Mid-MFCC filter bank are distributed densely in a middle frequency region and are sparse in a low frequency region and a high frequency region; the I-MFCC is an inverse Mel frequency cepstrum coefficient, the filter of the filter bank of the I-MFCC is sparsely distributed in a low-frequency area, and the filter of the filter bank of the I-MFCC is densely distributed in a high-frequency area.

In an embodiment of the invention, the 20-order Mel cepstrum coefficients obtained for MFCC include 1-6-order MFCC, 3-10-order Mid-MFCC and 7-12-order I-MFCC. In an embodiment of the present invention, the 1-6 order MFCC, the 3-10 order Mid-MFCC, and the 7-12 order I-MFCC are calculated by the following formulas:

Finally, the characteristic parameters of the improved MFCC consist of the maximum, minimum, mean and variance of the first difference of the Mel-frequency cepstral parameters and Mel-frequency cepstral coefficients (MFCCs) of these 20 orders.

Step S104: and forming a corresponding feature vector sequence by adopting the extracted feature parameters to obtain the feature vector sequence corresponding to the voice signal.

In a specific implementation, when extracting the feature parameters of the plurality of frames after the preprocessing, combining the extracted feature parameters into a corresponding feature vector sequence in sequence, thereby obtaining a feature vector sequence corresponding to the speech signal.

Step S105: and training and recognizing the characteristic vector sequence corresponding to the voice signal by adopting a Support Vector Machine (SVM) to obtain a corresponding voice emotion recognition result.

In specific implementation, when the feature vector sequence corresponding to the speech signal is obtained, the feature vector sequence corresponding to the speech signal may be trained and recognized by using a Support Vector Machine (SVM), so as to obtain a corresponding speech emotion recognition result.

In an embodiment of the present invention, a support vector machine kernel function is selected as a Radial Basis Function (RBF), and an adopted support vector machine classifier is a 5-class support vector machine classifier in a "one-vs-one" mode.

Specifically, in the process of training the support vector machine, five emotions are recognized, and 10 support vector machine classifiers can be constructed according to the "one-vs-one" strategy, namely "angry-fear", "angry-sad", "angry-neutral", "anger-happy", "fear-sad", "fear-neutral", "fear-happy", "sadness-happy", "neutral-happy" classifiers.

Next, the number of training set samples for each emotion is set to 150, and the number of test set samples is set to 50, and the feature vector sequence composed of the feature parameters extracted in the above-described steps is input to 10 support vector machine classifiers obtained by training.

The experimental comparison results of the emotion recognition accuracy obtained by the speech emotion recognition method in the embodiment of the invention and the speech emotion recognition method in the prior art are respectively shown in the following tables 1 and 2:

TABLE 1

TABLE 2

Through comparison of the tables, the accuracy recognition rate of the speech emotion recognition method in the embodiment of the invention is obviously improved.

The speech emotion recognition method in the embodiment of the present invention is described in detail above, and a device corresponding to the method will be described below.

FIG. 2 shows a structure of a speech emotion recognition apparatus in an embodiment of the present invention. Referring to fig. 2, the apparatus 20 may include an obtaining unit 201, a preprocessing unit 202, a parameter extracting unit 203, and a recognizing unit 204, wherein:

the obtaining unit 201 is adapted to obtain a voice signal to be processed.

The preprocessing unit 202 is adapted to preprocess the acquired voice signal to obtain a preprocessed voice signal.

A parameter extraction unit 203 adapted to extract feature parameters of the preprocessed voice signal; forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal; the characteristic parameters comprise short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone quality characteristic formants and derived parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC.

The recognition unit 204 is adapted to train and recognize the feature vector sequence corresponding to the speech signal by using a support vector machine, so as to obtain a corresponding speech emotion recognition result.

In a specific implementation, the preprocessing unit 202 is adapted to sample and quantize the acquired speech signal, pre-emphasis, frame windowing, short-time energy analysis, and endpoint detection.

In a specific implementation, the preprocessing unit 202 is adapted to traverse a plurality of frames obtained after preprocessing, and obtain a traversed current frame; calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame; when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame; and when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as the voice starting frame of the voice signal.

In an embodiment of the present invention, the short-time energy and its derived parameters of the preprocessed speech signal include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, short-time energy jitter, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing;

in an embodiment of the present invention, the pitch frequency and the derived parameters of the preprocessed speech signal include pitch frequency, maximum value of pitch frequency, minimum value of pitch frequency, mean value of pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the preprocessing, and a sum satisfying F (i) x F (i +1) |! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.

In an embodiment of the invention, the 20-order Mel cepstrum coefficients obtained for MFCC include 1-6-order MFCC, 3-10-order Mid-MFCC and 7-12-order I-MFCC.

In an embodiment of the present invention, the parameter extraction unit 203 is adapted to calculate and obtain 1-6 order MFCCs, 3-10 order Mid-MFCCs, and 7-12 order I-MFCCs by using the following formulas:

wherein f is_MelRepresenting the frequency, f, of the MFCC_Mid-MelFrequency, f, representing Mid-MFCC_I-MelDenotes the frequency of the I-MFCC, f denotes the actual frequency

The embodiment of the invention also provides a computer readable storage medium, which stores computer instructions, and the computer instructions execute the steps of the speech emotion recognition method when running. For the speech emotion recognition method, please refer to the introduction of the previous section, and the description is omitted.

The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of being operated on the processor, and the processor executes the steps of the speech emotion recognition method when the processor operates the computer instructions. For the speech emotion recognition method, please refer to the introduction of the previous section, and the description is omitted.

With the above-described method in an embodiment of the present invention,

those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A speech emotion recognition method is characterized by comprising the following steps:

acquiring a voice signal to be processed;

preprocessing the acquired voice signal to obtain a preprocessed voice signal; the preprocessing comprises sampling and quantizing the acquired voice signals, pre-emphasis, framing and windowing, short-time energy analysis and end point detection; when the voice signal is subjected to endpoint detection, determining a voice starting frame by adopting the following mode: traversing a plurality of frames obtained after preprocessing to obtain traversed current frames; calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame; when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame; when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as a voice initial frame of the voice signal;

extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel frequency cepstrum coefficients obtained by the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC; wherein, the nonlinear relation of Hz-Mel is corrected in the extraction of Mel frequency cepstrum coefficient, and 2 new coefficients Mid-MFCC and I-MFCC are introduced; the Mid-MFCC and the I-MFCC have good calculation precision in a medium-frequency region and a high-frequency region respectively, and can be used as a supplement for low-order MFCC to realize calculation of frequency spectrum characteristics of a full frequency domain;

2. The method according to claim 1, wherein the short-term energy and its derived parameters of the preprocessed speech signal comprise short-term energy of a plurality of frames obtained after the preprocessing, maximum value of the short-term energy, minimum value of the short-term energy, mean value of the short-term energy, variance of the short-term energy, jitter of the short-term energy, linear regression coefficient of the short-term energy, mean square error of the linear regression coefficient of the short-term energy, and proportion of the short-term energy below 250Hz to the total short-term energy.

3. The method according to claim 1, wherein the pitch frequency and its derived parameters of the preprocessed speech signal comprise pitch frequency, maximum pitch frequency, minimum pitch frequency, mean pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of the plurality of frames obtained after the preprocessing, and the sum of the pitch frequency and the derived parameters satisfies F (i) F (i + 1)! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.

4. The speech emotion recognition method of claim 1, wherein the psychoacoustic feature formants and their derived parameters of the preprocessed speech signal comprise a maximum of the first, second and third formant frequencies, a minimum of the first, second and third formant frequencies, a mean of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum of the second formant frequency ratio and a minimum of the second formant frequency ratio and a mean of the second formant frequency ratio of each voiced frame of the plurality of frames obtained after the preprocessing.

5. The method for speech emotion recognition of claim 1, wherein the Mel cepstral coefficients of order 20 for MFCC comprise MFCC of order 1-6, Mid-MFCC of order 3-10 and I-MFCC of order 7-12.

6. The emotion speech recognition method of claim 5, wherein the MFCC of 1-6 th order, the Mid-MFCC of 3-10 th order and the I-MFCC of 7-12 th order are calculated by the following formulas:

7. A computer readable storage medium having stored thereon computer instructions, wherein the computer instructions when executed perform the steps of the speech emotion recognition method according to any of claims 1 to 6.

8. A terminal, characterized in that it comprises a memory and a processor, said memory storing computer instructions capable of running on said processor, said processor executing said computer instructions to perform the steps of the speech emotion recognition method according to any of claims 1 to 6.