CN108305639B - Speech emotion recognition method, computer-readable storage medium and terminal - Google Patents

Speech emotion recognition method, computer-readable storage medium and terminal Download PDF

Info

Publication number
CN108305639B
CN108305639B CN201810455163.7A CN201810455163A CN108305639B CN 108305639 B CN108305639 B CN 108305639B CN 201810455163 A CN201810455163 A CN 201810455163A CN 108305639 B CN108305639 B CN 108305639B
Authority
CN
China
Prior art keywords
mfcc
short
frequency
order
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810455163.7A
Other languages
Chinese (zh)
Other versions
CN108305639A (en
Inventor
邓立新
王思羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201810455163.7A priority Critical patent/CN108305639B/en
Publication of CN108305639A publication Critical patent/CN108305639A/en
Application granted granted Critical
Publication of CN108305639B publication Critical patent/CN108305639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Abstract

A speech emotion recognition method and device, a computer readable storage medium and a terminal are provided, and the method comprises the following steps: acquiring a voice signal to be processed; preprocessing the acquired voice signal to obtain a preprocessed voice signal; extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC; forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal; and training and identifying the characteristic vector sequence corresponding to the voice signal by adopting a support vector machine to obtain a corresponding voice emotion identification result. By the scheme, the accuracy of speech emotion recognition can be improved.

Description

Speech emotion recognition method, computer-readable storage medium and terminal
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice emotion recognition method, a computer readable storage medium and a terminal.
Background
With the rapid development of information technology and the continuous enhancement of human dependence on computers, the capability of human-computer interaction is more and more emphasized by researchers. In fact, the problems to be solved in human-computer interaction are consistent with important factors in human-to-human communication, and the most important factor is the capacity of 'emotional intelligence'.
At present, the research on emotion information processing is being intensively conducted, and the research on emotion information processing in speech signals is receiving more and more attention. The speech emotion recognition refers to a technology for processing and recognizing speech signals by using a signal processing technology and a mode recognition method to judge which emotion the speech belongs to.
However, the conventional speech emotion recognition method has a problem of low recognition accuracy.
Disclosure of Invention
The invention solves the technical problem of how to improve the accuracy of speech emotion recognition.
In order to solve the above technical problem, an embodiment of the present invention provides a speech emotion recognition method, where the method includes:
acquiring a voice signal to be processed;
preprocessing the acquired voice signal to obtain a preprocessed voice signal;
extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC;
forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal;
and training and identifying the characteristic vector sequence corresponding to the voice signal by adopting a support vector machine to obtain a corresponding voice emotion identification result.
Optionally, the preprocessing the acquired voice signal includes:
the acquired speech signal is sampled and quantized, pre-emphasized, frame-windowed, short-time energy analyzed and end-point detected.
Optionally, when performing endpoint detection on the speech signal, determining a speech start frame in the following manner:
traversing a plurality of frames obtained after preprocessing to obtain traversed current frames;
calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame;
when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame;
and when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as the voice starting frame of the voice signal.
Optionally, the short-time energy of the preprocessed speech signal and its derived parameters include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, short-time energy jitter, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing.
Optionally, the pitch frequency and its derived parameters of the preprocessed speech signal include pitch frequency, maximum pitch frequency, minimum pitch frequency, mean pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the preprocessing, and a sum satisfying F (i) x F (i +1) |! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.
Optionally, the psychoacoustic feature formants of the preprocessed speech signal and the derived parameters thereof include a maximum of the first, second and third formant frequencies, a minimum of the first, second and third formant frequencies, a mean of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum of the second formant frequency ratio and a minimum of the second formant frequency ratio and a mean of the second formant frequency ratio of each of the plurality of frames obtained after the preprocessing.
Optionally, the Mel cepstrum coefficients of 20 th order obtained by the MFCC comprise MFCC of 1-6 th order, Mid-MFCC of 3-10 th order and I-MFCC of 7-12 th order.
Optionally, the 1-6 order MFCC, the 3-10 order Mid-MFCC, and the 7-12 order I-MFCC are calculated by the following formulas:
Figure BDA0001657493460000031
Figure BDA0001657493460000032
Figure BDA0001657493460000033
wherein f isMelRepresenting the frequency, f, of the MFCCMid-MelFrequency, f, representing Mid-MFCCI-MelDenotes the frequency of the I-MFCC, and f denotes the actual frequency.
The embodiment of the invention also provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the speech emotion recognition method are executed.
The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of being operated on the processor, and the processor executes the steps of the speech emotion recognition method when executing the computer instructions.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
according to the scheme, the characteristic parameters including the short-time energy and the derived parameters thereof, the fundamental tone frequency and the derived parameters thereof, the tone characteristic formant and the derived parameters thereof and the 20-order Mel cepstrum coefficient obtained by MFCC are extracted from the preprocessed voice signals, the extracted characteristic parameters of the 20-order Mel cepstrum coefficient obtained by MFCC cover the full frequency domain, and compared with the MFCC parameters only covering the low frequency, the recognition precision of the middle and high frequency domains can be improved, so that the accuracy of voice emotion recognition can be improved, and the use experience of a user can be improved.
Further, when detecting the end point of the voice signal, firstly calculating the short-time energy of the traversed current frame and the continuous preset number of frames after the traversed current frame, when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame are determined to be larger than or equal to the short-time energy of the initial silent section speech signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame, and when the ratio obtained by calculation is determined to be greater than or equal to a preset threshold value, determining that the traversed current frame is the voice initial frame of the voice signal, because the judgment of whether the short-time energy of the continuous preset number of frames including the current frame is greater than or equal to the short-time energy of the initial silent section voice signal is firstly carried out, the influence of glitch interference on the endpoint detection can be reduced or even avoided, so that the accuracy of the endpoint detection can be improved.
Drawings
FIG. 1 is a flow chart of a speech emotion recognition method in an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a speech emotion recognition apparatus in an embodiment of the present invention.
Detailed Description
According to the technical scheme, the voice emotion recognition method and device based on the voice emotion recognition are characterized in that the feature parameters including short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone characteristic formants and derived parameters thereof and 20-order Mel cepstrum coefficients obtained by MFCC are extracted from the preprocessed voice signals, the extracted feature parameters of the 20-order Mel cepstrum coefficients obtained by MFCC cover the full frequency domain, and compared with MFCC parameters only covering low frequency, the recognition accuracy of the middle and high frequency domains can be improved, so that the accuracy of voice emotion recognition can be improved, and the use experience of a user is improved.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
FIG. 1 is a flowchart illustrating a speech emotion recognition method according to an embodiment of the present invention. Referring to fig. 1, a speech emotion recognition method, the method comprising:
step S101: and acquiring a voice signal to be processed.
In a specific implementation, the voice signal to be processed is a digital signal obtained by performing analog-to-digital conversion on a corresponding analog signal.
Step S102: and preprocessing the acquired voice signal to obtain a preprocessed voice signal.
In a specific implementation, when preprocessing the acquired speech signal, the acquired speech signal is first sampled and quantized, pre-emphasized, frame-wise windowed, short-time energy analyzed, and end-point detected.
In one embodiment of the present invention, a first-order digital filter μ is used for pre-emphasis processing of a speech signal: h (z) ═ 1-. mu.z-1Mu is 0.98; the frame length adopted during framing is 320, and the frame shift is 80; the windowing is performed using a hamming window.
In an embodiment of the present invention, in order to improve the accuracy of short-point detection, a two-stage detection method is used to detect the language endpoint. Specifically, the method comprises the following steps:
firstly, the voice signals of a plurality of frames obtained after preprocessing can be traversed according to the sequence, the short-time energy of the voice signals of the traversed current frame and the subsequent continuous frames with the preset number is calculated, and the short-time energy of the traversed current frame and the subsequent continuous frames with the preset number is respectively compared with the short-time energy of the voice signal of the initial silent section, so as to determine whether the short-time energy of the traversed current frame and the subsequent continuous frames with the preset number is greater than or equal to the short-time energy of the voice signal of the initial silent section.
Then, when it is determined that the short-time energies of the traversed current frame and the subsequent consecutive frames of the preset number are both greater than or equal to the short-time energy of the initial unvoiced segment speech signal, a ratio of the short-time energies between the traversed current frame and the next frame may be calculated, and when it is determined that the calculated ratio is greater than or equal to a preset threshold, the traversed current frame is determined to be a speech start frame of the speech signal.
For example, the time domain expression of the speech signal is x (l), and the nth frame speech signal obtained after preprocessing such as framing and windowing is xn(m) is an example, its short-time energy EnComprises the following steps:
Figure BDA0001657493460000051
where N denotes a frame length of each frame.
The first stage is to detect the voice signals of five continuous frames and judge whether the short-time energy of the voice signals of five continuous frames meets the following conditions:
Ei≥Ti(IS),i∈{m,m+1,m+2,m+3,m+4} (2)
wherein IS IS the average duration of initial silence period of voice, Ti(IS) represents the short-time energy level of the initial silence segment speech signal, i.e., the background noise.
Through foretell first section detection, can avoid the burr to disturb the influence that produces the terminal detection.
When the first section is detected without errors, namely the short-time energy of the voice signals of five continuous frames is determined to be more than or equal to the short-time energy of the voice signals of the initial silent section, the second section ratio judgment is entered, namely:
Figure BDA0001657493460000052
wherein, let σnFor the second detection threshold, i.e. the preset threshold, σnThe ratio of the short-time energy of the speech signal of the next frame to the short-time energy of the speech signal of the previous frame in the two adjacent frames is used for judging the initial segment of the speech. The nth frame of voice data satisfying the above formula is the start frame of the voice.
The two-segment detection method is adopted to detect the language endpoint, so that the accuracy of voice initial frame detection can be improved, and the accuracy of voice emotion recognition can be improved.
Step S103: extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone quality characteristic formants and derived parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC.
In an embodiment of the present invention, the short-time energy of the preprocessed speech signal and the derived parameters thereof include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, a jitter of the short-time energy, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing.
Because the energy of the voice signal has strong correlation with the expression of emotion, firstly, the short-time energy of the voice signals of a plurality of frames after preprocessing is respectively calculated by adopting a calculation formula (1), and the maximum value of the short-time energy, the minimum value of the short-time energy, the mean value of the short-time energy and the variance of the short-time energy in the short-time energy of the voice signals of the plurality of frames after preprocessing are obtained.
And then calculating the short-time energy jitter, the linear regression coefficient of the short-time energy, the mean square error of the linear regression coefficient of the short-time energy and the proportion of the short-time energy below 250Hz to the total short-time energy of the preprocessed voice signals of a plurality of frames. Wherein:
in an embodiment of the present invention, the short-time energy jitter of the preprocessed speech signals of a plurality of frames is calculated by using the following formula:
Figure BDA0001657493460000061
wherein E issIndicating short-time energy jitter and M indicating the total frame number.
In an embodiment of the present invention, the linear regression coefficient of the short-time energy of the preprocessed speech signals of the plurality of frames is calculated by using the following formula:
Figure BDA0001657493460000062
wherein E isrRepresents the short-time energy linear regression coefficient, and M represents the total frame number.
In an embodiment of the present invention, the mean square error of the linear regression coefficient of the short-time energy of the preprocessed speech signals of the plurality of frames is calculated by using the following formula:
Figure BDA0001657493460000071
Figure BDA0001657493460000072
Figure BDA0001657493460000073
wherein E isqThe mean square error of the linear regression coefficients representing the short-time energy.
In an embodiment of the present invention, a ratio of the short-time energy of the preprocessed voice signals of the plurality of frames to the total short-time energy of the voice signals of the preprocessed voice signals with the frequency of 250Hz is:
Figure BDA0001657493460000074
wherein E is250Representing the sum of short-time energies below 250Hz in the frequency domain.
In an embodiment of the present invention, the pitch frequency and its derived parameters of the pre-processed speech signal include pitch frequency, maximum value of pitch frequency, minimum value of pitch frequency, mean value of pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the pre-processing, and a sum satisfying F (i) x F (i +1) |! The two adjacent frames that are 0 correspond to inter-voiced differential pitch. Wherein:
the frequency of the vocal cord vibrations is referred to as the pitch frequency. In one embodiment of the invention, a short-time autocorrelation function is used to obtain the pitch frequency. Specifically, an autocorrelation coefficient R corresponding to the preprocessed speech signal of each frame is first definedn(k) And then by detecting Rn(k) And extracting a corresponding pitch period value from the position of the peak value, and then obtaining the corresponding pitch frequency by reciprocal of the obtained pitch period value sphere.
In an embodiment of the present invention, for the preprocessed n-th frame speech signal xn(m) its autocorrelation function Rn(k) Comprises the following steps:
Figure BDA0001657493460000075
wherein R isn(k) The even function has a value in the range k, which is not zero (-N +1) to (N-1).
When the pitch frequency of the speech signal of the plurality of frames after the preprocessing is obtained, the maximum value of the pitch frequency, the minimum value of the pitch frequency, the mean value of the pitch frequency, and the variance of the pitch frequency among the pitch frequencies of the speech signal of the plurality of frames after the preprocessing can be obtained.
In one embodiment of the present invention, the bass frequency of the i-th voiced frame in the processed speech signal of the frames is represented as F0iThe total number of voiced frames of the processed speech signal of the plurality of frames is denoted M*And if the total frame number of the processed voice signals of the frames is represented as M, the corresponding first-order pitch frequency jitter and second-order pitch frequency jitter are respectively as follows:
Figure BDA0001657493460000081
Figure BDA0001657493460000082
wherein, F0s1Representing the first order pitch frequency jitter, F0s2Representing the second order pitch frequency jitter.
In an embodiment of the present invention, all the speech signals satisfying F (i) × F (i +1) | in the predetermined frames! Of two adjacent frames 0, the corresponding inter-voiced differential pitch dF is:
dF(k)=F(i)-F(i+1),1≤k≤M*,1≤i≤M (11)
where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.
In an embodiment of the present invention, the psychoacoustic characteristic formants of the preprocessed speech signal and the derived parameters thereof include a maximum value of the first, second and third formant frequencies, a minimum value of the first, second and third formant frequencies, a mean value of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum value of the second formant frequency ratio and a minimum value of the second formant frequency ratio and a mean value of the second formant frequency ratio of each of the plurality of frames obtained after the preprocessing.
In a specific implementation, the formant parameters include formant bandwidth and frequency, and the formant parameters are extracted based on estimating a spectral envelope of the speech signal and estimating the formant parameters from the vocal tract model by using a Linear Prediction (LPC) method.
In an embodiment of the present invention, the LPC method is used to deconvolve the speech signal, and the parameters of the global model of the vocal tract response are obtained as follows:
Figure BDA0001657493460000083
then, if one root of the prediction error filter a (z) is found, the formant frequency corresponding to i is:
Figure BDA0001657493460000091
where T is the sampling period.
The first, second and third formant frequencies of the ith voiced frame are denoted as F1 respectivelyi、F2i、F3iThen, the second formant frequency ratio is:
F2i/(F2i-F1i) (14)
when the first, second, and third formant frequencies corresponding to the voice signals of the plurality of frames are calculated by using the above formula, the maximum value, the minimum value, the mean value, the variance, and the first-order jitter of the first, second, and third formant frequencies corresponding to the voice signals of the plurality of frames, and the maximum value, the minimum value, and the mean value of the ratio of the second formant frequencies can be obtained.
In one embodiment of the invention, in order to improve the identification accuracy, the nonlinear relationship of Hz-Mel is corrected in the extraction of Mel (Mel) frequency cepstrum coefficients, and 2 new coefficients Mid-MFCC and I-MFCC are introduced. The Mid-MFCC and the I-MFCC have good calculation precision in the middle-frequency region and the high-frequency region respectively, and can be used as a supplement for low-order MFCC to realize calculation of the frequency spectrum characteristics of a full frequency domain. The filters of the Mid-MFCC filter bank are distributed densely in a middle frequency region and are sparse in a low frequency region and a high frequency region; the I-MFCC is an inverse Mel frequency cepstrum coefficient, the filter of the filter bank of the I-MFCC is sparsely distributed in a low-frequency area, and the filter of the filter bank of the I-MFCC is densely distributed in a high-frequency area.
In an embodiment of the invention, the 20-order Mel cepstrum coefficients obtained for MFCC include 1-6-order MFCC, 3-10-order Mid-MFCC and 7-12-order I-MFCC. In an embodiment of the present invention, the 1-6 order MFCC, the 3-10 order Mid-MFCC, and the 7-12 order I-MFCC are calculated by the following formulas:
Figure BDA0001657493460000092
Figure BDA0001657493460000093
Figure BDA0001657493460000094
wherein f isMelRepresenting the frequency, f, of the MFCCMid-MelFrequency, f, representing Mid-MFCCI-MelDenotes the frequency of the I-MFCC, and f denotes the actual frequency.
Finally, the characteristic parameters of the improved MFCC consist of the maximum, minimum, mean and variance of the first difference of the Mel-frequency cepstral parameters and Mel-frequency cepstral coefficients (MFCCs) of these 20 orders.
Step S104: and forming a corresponding feature vector sequence by adopting the extracted feature parameters to obtain the feature vector sequence corresponding to the voice signal.
In a specific implementation, when extracting the feature parameters of the plurality of frames after the preprocessing, combining the extracted feature parameters into a corresponding feature vector sequence in sequence, thereby obtaining a feature vector sequence corresponding to the speech signal.
Step S105: and training and recognizing the characteristic vector sequence corresponding to the voice signal by adopting a Support Vector Machine (SVM) to obtain a corresponding voice emotion recognition result.
In specific implementation, when the feature vector sequence corresponding to the speech signal is obtained, the feature vector sequence corresponding to the speech signal may be trained and recognized by using a Support Vector Machine (SVM), so as to obtain a corresponding speech emotion recognition result.
In an embodiment of the present invention, a support vector machine kernel function is selected as a Radial Basis Function (RBF), and an adopted support vector machine classifier is a 5-class support vector machine classifier in a "one-vs-one" mode.
Specifically, in the process of training the support vector machine, five emotions are recognized, and 10 support vector machine classifiers can be constructed according to the "one-vs-one" strategy, namely "angry-fear", "angry-sad", "angry-neutral", "anger-happy", "fear-sad", "fear-neutral", "fear-happy", "sadness-happy", "neutral-happy" classifiers.
Next, the number of training set samples for each emotion is set to 150, and the number of test set samples is set to 50, and the feature vector sequence composed of the feature parameters extracted in the above-described steps is input to 10 support vector machine classifiers obtained by training.
The experimental comparison results of the emotion recognition accuracy obtained by the speech emotion recognition method in the embodiment of the invention and the speech emotion recognition method in the prior art are respectively shown in the following tables 1 and 2:
TABLE 1
Figure BDA0001657493460000101
Figure BDA0001657493460000111
TABLE 2
Figure BDA0001657493460000112
Through comparison of the tables, the accuracy recognition rate of the speech emotion recognition method in the embodiment of the invention is obviously improved.
The speech emotion recognition method in the embodiment of the present invention is described in detail above, and a device corresponding to the method will be described below.
FIG. 2 shows a structure of a speech emotion recognition apparatus in an embodiment of the present invention. Referring to fig. 2, the apparatus 20 may include an obtaining unit 201, a preprocessing unit 202, a parameter extracting unit 203, and a recognizing unit 204, wherein:
the obtaining unit 201 is adapted to obtain a voice signal to be processed.
The preprocessing unit 202 is adapted to preprocess the acquired voice signal to obtain a preprocessed voice signal.
A parameter extraction unit 203 adapted to extract feature parameters of the preprocessed voice signal; forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal; the characteristic parameters comprise short-time energy and derived parameters thereof, fundamental tone frequency and derived parameters thereof, tone quality characteristic formants and derived parameters thereof, 20-order Mel cepstrum coefficients obtained for the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC.
The recognition unit 204 is adapted to train and recognize the feature vector sequence corresponding to the speech signal by using a support vector machine, so as to obtain a corresponding speech emotion recognition result.
In a specific implementation, the preprocessing unit 202 is adapted to sample and quantize the acquired speech signal, pre-emphasis, frame windowing, short-time energy analysis, and endpoint detection.
In a specific implementation, the preprocessing unit 202 is adapted to traverse a plurality of frames obtained after preprocessing, and obtain a traversed current frame; calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame; when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame; and when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as the voice starting frame of the voice signal.
In an embodiment of the present invention, the short-time energy and its derived parameters of the preprocessed speech signal include short-time energy, a maximum value of the short-time energy, a minimum value of the short-time energy, a mean value of the short-time energy, a variance of the short-time energy, short-time energy jitter, a linear regression coefficient of the short-time energy, a mean square error of the linear regression coefficient of the short-time energy, and a proportion of the short-time energy below 250Hz to the total short-time energy of the plurality of frames obtained after the preprocessing;
in an embodiment of the present invention, the pitch frequency and the derived parameters of the preprocessed speech signal include pitch frequency, maximum value of pitch frequency, minimum value of pitch frequency, mean value of pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of a plurality of frames obtained after the preprocessing, and a sum satisfying F (i) x F (i +1) |! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.
In an embodiment of the present invention, the psychoacoustic characteristic formants of the preprocessed speech signal and the derived parameters thereof include a maximum value of the first, second and third formant frequencies, a minimum value of the first, second and third formant frequencies, a mean value of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum value of the second formant frequency ratio and a minimum value of the second formant frequency ratio and a mean value of the second formant frequency ratio of each of the plurality of frames obtained after the preprocessing.
In an embodiment of the invention, the 20-order Mel cepstrum coefficients obtained for MFCC include 1-6-order MFCC, 3-10-order Mid-MFCC and 7-12-order I-MFCC.
In an embodiment of the present invention, the parameter extraction unit 203 is adapted to calculate and obtain 1-6 order MFCCs, 3-10 order Mid-MFCCs, and 7-12 order I-MFCCs by using the following formulas:
Figure BDA0001657493460000131
Figure BDA0001657493460000132
Figure BDA0001657493460000133
wherein f isMelRepresenting the frequency, f, of the MFCCMid-MelFrequency, f, representing Mid-MFCCI-MelDenotes the frequency of the I-MFCC, f denotes the actual frequency
The embodiment of the invention also provides a computer readable storage medium, which stores computer instructions, and the computer instructions execute the steps of the speech emotion recognition method when running. For the speech emotion recognition method, please refer to the introduction of the previous section, and the description is omitted.
The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of being operated on the processor, and the processor executes the steps of the speech emotion recognition method when the processor operates the computer instructions. For the speech emotion recognition method, please refer to the introduction of the previous section, and the description is omitted.
With the above-described method in an embodiment of the present invention,
those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A speech emotion recognition method is characterized by comprising the following steps:
acquiring a voice signal to be processed;
preprocessing the acquired voice signal to obtain a preprocessed voice signal; the preprocessing comprises sampling and quantizing the acquired voice signals, pre-emphasis, framing and windowing, short-time energy analysis and end point detection; when the voice signal is subjected to endpoint detection, determining a voice starting frame by adopting the following mode: traversing a plurality of frames obtained after preprocessing to obtain traversed current frames; calculating the short-time energy of the traversed current frame and the continuous frames with the preset number behind the traversed current frame; when the short-time energy of the traversed current frame and the continuous preset number of frames behind the traversed current frame is determined to be more than or equal to the short-time energy of the initial silent section voice signal, calculating the ratio of the short-time energy between the traversed current frame and the next frame; when the calculated ratio is determined to be greater than or equal to a preset threshold value, determining the traversed current frame as a voice initial frame of the voice signal;
extracting the characteristic parameters of the preprocessed voice signals; the characteristic parameters comprise short-time energy and derivative parameters thereof, fundamental tone frequency and derivative parameters thereof, tone quality characteristic formants and derivative parameters thereof, 20-order Mel frequency cepstrum coefficients obtained by the MFCC, the maximum value of the first-order difference of the MFCC, the minimum value of the first-order difference of the MFCC, the mean value of the first-order difference of the MFCC and the variance of the first-order difference of the MFCC; wherein, the nonlinear relation of Hz-Mel is corrected in the extraction of Mel frequency cepstrum coefficient, and 2 new coefficients Mid-MFCC and I-MFCC are introduced; the Mid-MFCC and the I-MFCC have good calculation precision in a medium-frequency region and a high-frequency region respectively, and can be used as a supplement for low-order MFCC to realize calculation of frequency spectrum characteristics of a full frequency domain;
forming a corresponding feature vector sequence by using the extracted feature parameters to obtain a feature vector sequence corresponding to the voice signal;
and training and identifying the characteristic vector sequence corresponding to the voice signal by adopting a support vector machine to obtain a corresponding voice emotion identification result.
2. The method according to claim 1, wherein the short-term energy and its derived parameters of the preprocessed speech signal comprise short-term energy of a plurality of frames obtained after the preprocessing, maximum value of the short-term energy, minimum value of the short-term energy, mean value of the short-term energy, variance of the short-term energy, jitter of the short-term energy, linear regression coefficient of the short-term energy, mean square error of the linear regression coefficient of the short-term energy, and proportion of the short-term energy below 250Hz to the total short-term energy.
3. The method according to claim 1, wherein the pitch frequency and its derived parameters of the preprocessed speech signal comprise pitch frequency, maximum pitch frequency, minimum pitch frequency, mean pitch frequency, variance of pitch frequency, first order pitch frequency jitter, second order pitch frequency jitter of the plurality of frames obtained after the preprocessing, and the sum of the pitch frequency and the derived parameters satisfies F (i) F (i + 1)! The voiced difference pitch corresponding to two adjacent frames which are 0; where F (i) represents the pitch frequency of the i-th frame, and F (i +1) represents the pitch frequency of the i + 1-th frame.
4. The speech emotion recognition method of claim 1, wherein the psychoacoustic feature formants and their derived parameters of the preprocessed speech signal comprise a maximum of the first, second and third formant frequencies, a minimum of the first, second and third formant frequencies, a mean of the first, second and third formant frequencies, a variance of the first, second and third formant frequencies and a first order jitter of the first, second and third formant frequencies, a maximum of the second formant frequency ratio and a minimum of the second formant frequency ratio and a mean of the second formant frequency ratio of each voiced frame of the plurality of frames obtained after the preprocessing.
5. The method for speech emotion recognition of claim 1, wherein the Mel cepstral coefficients of order 20 for MFCC comprise MFCC of order 1-6, Mid-MFCC of order 3-10 and I-MFCC of order 7-12.
6. The emotion speech recognition method of claim 5, wherein the MFCC of 1-6 th order, the Mid-MFCC of 3-10 th order and the I-MFCC of 7-12 th order are calculated by the following formulas:
Figure FDA0002865167000000021
Figure FDA0002865167000000022
Figure FDA0002865167000000023
wherein f isMelRepresenting the frequency, f, of the MFCCMid-MelFrequency, f, representing Mid-MFCCI-MelDenotes the frequency of the I-MFCC, and f denotes the actual frequency.
7. A computer readable storage medium having stored thereon computer instructions, wherein the computer instructions when executed perform the steps of the speech emotion recognition method according to any of claims 1 to 6.
8. A terminal, characterized in that it comprises a memory and a processor, said memory storing computer instructions capable of running on said processor, said processor executing said computer instructions to perform the steps of the speech emotion recognition method according to any of claims 1 to 6.
CN201810455163.7A 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal Active CN108305639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810455163.7A CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810455163.7A CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Publications (2)

Publication Number Publication Date
CN108305639A CN108305639A (en) 2018-07-20
CN108305639B true CN108305639B (en) 2021-03-09

Family

ID=62846586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810455163.7A Active CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Country Status (1)

Country Link
CN (1) CN108305639B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243492A (en) * 2018-10-28 2019-01-18 国家计算机网络与信息安全管理中心 A kind of speech emotion recognition system and recognition methods
CN109300339A (en) * 2018-11-19 2019-02-01 王泓懿 A kind of exercising method and system of Oral English Practice
CN109946055B (en) * 2019-03-22 2021-01-12 宁波慧声智创科技有限公司 Method and system for detecting abnormal sound of automobile seat slide rail
CN110491417A (en) * 2019-08-09 2019-11-22 北京影谱科技股份有限公司 Speech-emotion recognition method and device based on deep learning
CN111243627B (en) * 2020-01-13 2022-09-27 云知声智能科技股份有限公司 Voice emotion recognition method and device
CN113807249B (en) * 2021-09-17 2024-01-12 广州大学 Emotion recognition method, system, device and medium based on multi-mode feature fusion

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578470A (en) * 2012-08-09 2014-02-12 安徽科大讯飞信息科技股份有限公司 Telephone recording data processing method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8731932B2 (en) * 2010-08-06 2014-05-20 At&T Intellectual Property I, L.P. System and method for synthetic voice generation and modification
CN106887241A (en) * 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 A kind of voice signal detection method and device
CN106448659B (en) * 2016-12-19 2019-09-27 广东工业大学 A kind of sound end detecting method based on short-time energy and fractal dimension
CN108010515B (en) * 2017-11-21 2020-06-30 清华大学 Voice endpoint detection and awakening method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578470A (en) * 2012-08-09 2014-02-12 安徽科大讯飞信息科技股份有限公司 Telephone recording data processing method and system

Also Published As

Publication number Publication date
CN108305639A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Tan et al. rVAD: An unsupervised segment-based robust voice activity detection method
CN106935248B (en) Voice similarity detection method and device
Mak et al. A study of voice activity detection techniques for NIST speaker recognition evaluations
CN107610715B (en) Similarity calculation method based on multiple sound characteristics
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
Zão et al. Speech enhancement with EMD and hurst-based mode selection
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
US10410623B2 (en) Method and system for generating advanced feature discrimination vectors for use in speech recognition
CN108682432B (en) Speech emotion recognition device
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
Archana et al. Gender identification and performance analysis of speech signals
Eringis et al. Improving speech recognition rate through analysis parameters
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
Labied et al. An overview of automatic speech recognition preprocessing techniques
CN112151066A (en) Voice feature recognition-based language conflict monitoring method, medium and equipment
Alam et al. Robust feature extractors for continuous speech recognition
Ijitona et al. Improved silence-unvoiced-voiced (SUV) segmentation for dysarthric speech signals using linear prediction error variance
Vlaj et al. Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria
Vachhani et al. Use of PLP cepstral features for phonetic segmentation
Shome et al. Non-negative frequency-weighted energy-based speech quality estimation for different modes and quality of speech
Chit et al. Myanmar continuous speech recognition system using fuzzy logic classification in speech segmentation
CN110634473A (en) Voice digital recognition method based on MFCC
JP4576612B2 (en) Speech recognition method and speech recognition apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant