CN108305639A

CN108305639A - Speech-emotion recognition method, computer readable storage medium, terminal

Info

Publication number: CN108305639A
Application number: CN201810455163.7A
Authority: CN
Inventors: 邓立新; 王思羽
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-07-20
Anticipated expiration: 2038-05-11
Also published as: CN108305639B

Abstract

A kind of speech-emotion recognition method and device, computer readable storage medium, terminal, the method includes：Obtain pending voice signal；Acquired voice signal is pre-processed, pretreated voice signal is obtained；Extract the characteristic parameter of pretreated voice signal；The characteristic parameter includes the variance of the first-order difference of short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, the maximum value to the first-order differences of the Mel cepstrum coefficients and MFCC of MFCC 20 ranks sought, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and MFCC；Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature vector sequence of the voice signal；The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding speech emotion recognition result.Above-mentioned scheme can improve the accuracy rate of speech emotion recognition.

Description

Speech-emotion recognition method, computer readable storage medium, terminal

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of speech-emotion recognition method, computer-readable Storage medium, terminal.

Background technology

The dependence of computer is constantly enhanced with the high speed development and the mankind of information technology, the ability of human-computer interaction Increasingly paid attention to by researcher.Key factor during actually problem to be solved is exchanged with person to person in human-computer interaction Be consistent, most critical be all " emotion intelligence " ability.

Currently, during the research about Kansei Information Processing is constantly goed deep into, the Kansei Information Processing in voice signal Research be increasingly valued by people.Speech emotion recognition therein refers to and utilizes signal processing technology and pattern-recognition Method is come to Speech processing and identification, to judge that voice belongs to the technology of which kind of emotion.

But existing speech-emotion recognition method, it there is a problem that recognition accuracy is low.

Invention content

Present invention solves the technical problem that being how to improve the accuracy rate of speech emotion recognition.

In order to solve the above technical problems, an embodiment of the present invention provides a kind of speech-emotion recognition method, the method packet It includes：

Obtain pending voice signal；

Acquired voice signal is pre-processed, pretreated voice signal is obtained；

Extract the characteristic parameter of pretreated voice signal；The characteristic parameter includes short-time energy and its derivative ginseng Number, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrums of MFCC 20 ranks sought The maximum value of the first-order difference of coefficient and MFCC, the first-order difference minimum value of MFCC, the mean value and MFCC of the first-order difference of MFCC First-order difference variance；

Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature of the voice signal Vector sequence；

The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, is corresponded to Speech emotion recognition result.

Optionally, described that acquired voice signal is pre-processed, including：

Acquired voice signal is sampled and is quantified, the inspection of preemphasis, framing adding window, short-time energy analysis and endpoint It surveys.

Optionally, when carrying out end-point detection to the voice signal, voice start frame is determined by the way of following：

The multiple frames obtained after pretreatment are traversed, the present frame traversed is obtained；

Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter；

When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to just When the short-time energy of beginning unvoiced segments voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated；

When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is institute's predicate The voice start frame of sound signal.

Optionally, the short-time energy of the pretreated voice signal and its derivative parameter, including after the pretreatment The maximum value of short-time energy, the short-time energy of obtained multiple frames, the minimum value of short-time energy, the mean value of short-time energy, in short-term The variance of energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression coeffficient mean square error Difference and 250Hz or less short-time energies account for the ratio of whole short-time energies.

Optionally, the fundamental frequency of the pretreated voice signal and its derivative parameter, including after the pretreatment The fundamental frequency of obtained multiple frames, the maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency, fundamental tone F (i) * F (i+1) are shaken and met to the variance of frequency, the shake of single order fundamental frequency, second order fundamental frequency！=0 adjacent two frame pair Differential pitch between the voiced sound answered；Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.

Optionally, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative parameter, including it is described pre- The maximum value of first, second, and third formant frequency of each unvoiced frame, the first, second and in the multiple frames obtained after processing The minimum value of third formant frequency, the mean value of the first, second, and third formant frequency, the first, second, and third formant The single order of the variance of frequency and the first, second, and third formant frequency shake, the second formant frequency ratio maximum value and Second formant frequency ratio minimum value and the second formant frequency ratio mean value.

Optionally, Mel cepstrum coefficients of 20 ranks that MFCC is sought, including the MFCC of 1~6 rank, 3~10 ranks The I-MFCC of Mid-MFCC and 7~12 ranks.

Optionally, the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks is respectively adopted following Formula is calculated：

Wherein, f_MelIndicate the frequency of MFCC, f_Mid-MelIndicate the frequency of Mid-MFCC, f_I-MelIndicate the frequency of I-MFCC, F indicates actual frequency.

The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described The step of computer instruction executes speech-emotion recognition method described in any one of the above embodiments when running.

The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory Enough computer instructions run on the processor, the processor execute any of the above-described when running the computer instruction The step of described speech-emotion recognition method.

Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that：

Above-mentioned scheme, by the way that pretreated voice signal, extraction includes short-time energy and its derivative parameter, fundamental tone Frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and exist to the Mel cepstrum coefficients of MFCC 20 ranks sought Interior characteristic parameter, the characteristic parameters of the Mel cepstrum coefficients for 20 ranks sought to MFCC extracted cover full frequency-domain, and only cover The MFCC parameters of lid low frequency are compared, and the accuracy of identification of middle and high frequency domain can be improved, so as to improve the standard of speech emotion recognition True rate promotes the usage experience of user.

Further, when carrying out end-point detection to the voice signal, the present frame traversed and thereafter is calculated first The short-time energy of the frame of continuous preset quantity, when determining the short of the present frame traversed and the frame of continuous preset quantity thereafter Shi Nengliang be all higher than or equal to initial unvoiced segments voice signal short-time energy when, calculate the present frame that traverses and next frame it Between the ratio of short-time energy determine and traverse and when determining that the ratio that is calculated is greater than or equal to preset threshold value Present frame be the voice signal voice start frame, because first to the frame of the continuous preset quantity including present frame into Whether row short-time energy is all higher than or the judgement of short-time energy equal to initial unvoiced segments voice signal, it is possible to reduce or even avoid Burr interferes the influence generated to end-point detection, therefore can improve the accuracy of end-point detection.

Description of the drawings

Fig. 1 is a kind of flow diagram of speech-emotion recognition method in the embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of speech emotion recognition device in the embodiment of the present invention.

Specific implementation mode

Technical solution in the embodiment of the present invention by pretreated voice signal, extraction include short-time energy and its Derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and to MFCC 20 ranks sought Characteristic parameter including Mel cepstrum coefficients, the characteristic parameter covering of the Mel cepstrum coefficients for 20 ranks that MFCC is sought extracted Full frequency-domain can improve the accuracy of identification of middle and high frequency domain, so as to improve language compared with the only MFCC parameters of covering low frequency The accuracy rate of sound emotion recognition promotes the usage experience of user.

It is understandable to enable above-mentioned purpose, feature and the advantageous effect of the present invention to become apparent, below in conjunction with the accompanying drawings to this The specific embodiment of invention is described in detail.

Fig. 1 is a kind of flow diagram of speech-emotion recognition method of the embodiment of the present invention.With reference to figure 1, a kind of voice Emotion identification method, the method includes：

Step S101：Obtain pending voice signal.

In specific implementation, the pending voice signal is to be obtained after carrying out analog-to-digital conversion to corresponding analog signal Digital signal.

Step S102：Acquired voice signal is pre-processed, pretreated voice signal is obtained.

In specific implementation, when being pre-processed to acquired voice signal, first to acquired voice signal into Row sampling and quantization, preemphasis, framing adding window, short-time energy analysis and end-point detection.

Using the digital filter μ of single order when in an embodiment of the present invention, to voice signal progress preemphasis processing：H (z)=1- μ z^-1, μ takes 0.98；For the frame length used when framing for 320, it is 80 that frame, which moves,；Adding window is using Hamming window.

In an embodiment of the present invention, in order to improve the accuracy of short point detection, using two-part detection method to language end Point is detected.Specifically：

It is possible, firstly, to be traversed in sequence to the voice signal of the multiple frames obtained after pretreatment, calculating traverses Present frame the voice signal voice signal of the frame of continuous preset quantity thereafter short-time energy, and it is current by what is traversed The short-time energy of frame and the thereafter short-time energy of the frame of continuous preset quantity respectively with initial unvoiced segments voice signal is compared Compared with to determine whether the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to initial The short-time energy of unvoiced segments voice signal.

Then, when the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or wait When the short-time energy of initial unvoiced segments voice signal, it can calculate and calculate between the present frame and next frame that traverse in short-term The ratio of energy, and when determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is The voice start frame of the voice signal.

For example, be x (l) with the time-domain expression of voice signal, the n-th frame language obtained after the pretreatments such as framing adding window Sound signal is x_n(m) for, then its short-time energy E_nFor：

Wherein, N indicates the frame length of each frame.

First segment is detected as detecting the voice signal of continuous five frame, judges that the short-time energy of the voice signal of continuous five frame is No satisfaction：

E_i>=Ti (IS), i ∈ { m, m+1, m+2, m+3, m+4 } (2)

Wherein, IS is the average duration of the initial unvoiced segments of voice, T_i(IS) initial unvoiced segments voice signal is indicated in short-term Energy size, as ambient noise.

It is detected by above-mentioned first segment, the influence generated to end-point detection can be interfered to avoid burr.

When first segment detects errorless, namely determine that the short-time energy of the voice signal of continuous five frame is all higher than or equal to initial When the short-time energy of unvoiced segments voice signal, then enters second segment ratio and adjudicate, i.e.,：

Wherein, if σ_nFor second segment detection threshold, i.e., preset threshold value, σ_nIndicate latter in adjacent two frame voice signals Ratio between the voice signal short-time energy of frame and the short-time energy of the voice signal of former frame, for judging the starting of voice Section.Meet the n-th frame voice data of above formula, the as start frame of voice.

It uses two-part detection method to be detected language endpoint, the accuracy of voice start frame detection can be improved, So as to improve the accuracy of speech emotion recognition.

Step S103：Extract the characteristic parameter of pretreated voice signal；The characteristic parameter include short-time energy and It derives parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC 20 ranks sought The maximum value of the first-order difference of Mel cepstrum coefficients and MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC With the variance of the first-order difference of MFCC.

In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies.

Wherein, because the expression of the energy of voice signal and emotion has strong correlation, calculating formula (1) is used first respectively The short-time energy of the voice signal of multiple frames after pretreatment is calculated, and the voice signal for seeking multiple frames after pretreatment in short-term can The maximum value of short-time energy, the variance of the minimum value of short-time energy, the mean value of short-time energy and short-time energy in amount.

Then, the linear of the short-time energy shake, short-time energy of the voice signal of multiple frames after pre-processing is calculated to return Return coefficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for the ratios of whole short-time energies Example.Wherein：

In an embodiment of the present invention, the short-time energy shake of the voice signal of multiple frames uses following public affairs after pretreatment Formula calculates：

Wherein, E_sIndicate that short-time energy shake, M indicate totalframes.

In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames is adopted after pretreatment It is calculated with following formula：

Wherein, E_rIndicate that short-time energy linear regression coeffficient, M indicate totalframes.

In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames after pretreatment Mean square error is calculated using following formula：

Wherein, E_qIndicate the mean square error of the linear regression coeffficient of short-time energy.

In an embodiment of the present invention, 250Hz or less in the short-time energy of the voice signal of multiple frames after the pretreatment The ratio that short-time energy accounts for whole short-time energies is：

Wherein, E₂₅₀Indicate the sum of 250Hz short-time energies below in a frequency domain.

In an embodiment of the present invention, the fundamental frequency of pretreated voice signal and its derivative parameter, including it is described The fundamental frequency of the multiple frames obtained after pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency it is equal F (i) * F (i+1) are shaken and met to value, the variance of fundamental frequency, the shake of single order fundamental frequency, second order fundamental frequency！=0 phase Differential pitch between the corresponding voiced sound of two frames of neighbour.Wherein：

The frequency that vocal cords shake is known as fundamental frequency.In an embodiment of the present invention, using short-time autocorrelation function come Obtain fundamental frequency.Specifically, the corresponding auto-correlation coefficient R of voice signal of pretreated each frame is defined first_n(k), Again by detecting R_n(k) position of peak value extracts corresponding pitch period value, then the pitch period value ball inverse to obtaining is i.e. Obtain corresponding fundamental frequency.

In an embodiment of the present invention, for the voice signal x of pretreated n-th frame_n(m), auto-correlation function R_n (k) it is：

Wherein, R_n(k) it is even function, the range k=(- N+1) that value is not zero~(N-1).

When seeking the fundamental frequency of voice signal of pretreated multiple frames, can obtain pretreated multiple The maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency and base in the fundamental frequency of the voice signal of frame The variance of voice frequency.

In an embodiment of the present invention, by the fundamental tone frequency of i-th of unvoiced frame in the voice signal of treated multiple frames Rate is expressed as F0_i, the unvoiced frame sum of the voice signal of treated multiple frames is expressed as M^*, the voice of treated multiple frames The totalframes of signal is expressed as M, then corresponding single order fundamental frequency shake and the shake of second order fundamental frequency is respectively：

Wherein, F0_s1Indicate the single order fundamental frequency shake, F0_s2Indicate the second order fundamental frequency shake.

In an embodiment of the present invention, all in the voice signal of preset multiple frames to meet F (i) * F (i+1)！=0 Among two adjacent frames, differential pitch dF is between corresponding voiced sound：

DF (k)=F (i)-F (i+1), 1≤k≤M^*, 1≤i≤M (11)

Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.

In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first, Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency The mean value of the minimum value and the second formant frequency ratio of the maximum value of ratio and the second formant frequency ratio.

In specific implementation, the formant parameter includes resonance peak bandwidth and frequency, and the basis of the two extraction is to language The spectrum envelope of sound signal is estimated, and formant parameter is estimated from channel model using linear prediction (LPC) method.

In an embodiment of the present invention, uncoiling is carried out to voice signal with LPC methods, obtains the all-pole model ginseng of sound channel response Number is：

Then, a root of prediction error filter A (z) is found out, then the corresponding formant frequencies of i are：

Wherein, T is the sampling period.

First, second and third formant frequency of i-th of unvoiced frame is expressed as F1_i、F2_i、F3_i, then the second formant Frequency ratio is：

F2_i/(F2_i-F1_i) (14)

When the corresponding first, second, third formant frequency of the voice signal that multiple frames are calculated using above-mentioned formula When rate, it is total that first, second, third can be sought in corresponding first, second, third formant frequency of voice signal of multiple frames The maximum value of peak frequency of shaking, minimum value, mean value, variance and the maximum value of single order shake and the second formant frequency ratio, most Small value and mean value.

In an embodiment of the present invention, in order to improve the accuracy of identification, in the extraction of Meier (Mel) frequency cepstral coefficient In the non-linear relation of Hz-Mel is corrected, introduce 2 new coefficient Mid-MFCC and I-MFCC.Mid-MFCC and I-MFCC has good computational accuracy in medium, high frequency region respectively, can be used as the supplement to low order MFCC, realizes to full frequency-domain Spectrum signature calculated.Wherein, the filter of Mid-MFCC filters group is distributed more intensive in mid-frequency region, low Frequently, high-frequency region is more sparse；I-MFCC is inverse Mel frequency cepstral coefficients, and the filter of the filter group of I-MFCC is in low frequency Area distribution is sparse, and high-frequency region distribution is more intensive.

In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.In an embodiment of the present invention, the MFCC of 1~6 rank, 3~10 The Mid-MFCC of the rank and I-MFCC of 7~12 ranks is respectively adopted following formula and is calculated：

Finally, the characteristic parameter of improved MFCC by this 20 rank Mel cepstrum parameters and Mel cepstrum coefficients (MFCC) one Maximum value, minimum value, mean value and the variance composition of order difference.

Step S104：Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the voice signal Corresponding feature vector sequence.

In specific implementation, when extracting the characteristic parameter of pretreated multiple frames, the characteristic parameter extracted is pressed According to character pair vector sequence is sequentially combined into, to obtain the corresponding feature vector sequence of the voice signal.

Step S105：The corresponding feature vector sequence of the voice signal is trained using support vector machines (SVM) And identification, obtain corresponding speech emotion recognition result.

In specific implementation, when obtaining the corresponding feature vector sequence of the voice signal, the voice can be believed Number corresponding feature vector sequence is trained and is identified using support vector machines (SVM), to obtain corresponding speech emotional Recognition result.

In an embodiment of the present invention, choose support vector machines core letter be radial basis function (RBF), it is used support to Amount machine grader is the 5 class support vector machines graders of " one-vs-one " pattern.

Specifically, during Training Support Vector Machines, five kinds of emotions are identified, according to " one-vs-one " plan 10 support vector machine classifiers can be slightly built, are " indignation-fear ", " indignation-is sad ", " indignation-is neutral ", " anger respectively Anger-happiness ", " fearing-sadness ", " fearing-neutrality ", " fearing-happiness ", " sad-neutral ", " sadness-happiness ", " neutrality-is high It is emerging " grader.

Then, the training set sample number that each mood is arranged is 150, and test set sample number is 50, will be above-mentioned The feature vector sequence that characteristic parameter composition is extracted in step is input to 10 support vector machine classifiers that training obtains.

Using speech-emotion recognition method and the speech-emotion recognition method institute in the prior art in the embodiment of the present invention The Experimental comparison results of the recognition accuracy of emotion recognition are obtained, respectively as shown in the following table 1, table 2：

Table 1

Table 2

Pass through the comparison of above-mentioned table, it can be seen that the accurate knowledge of the speech-emotion recognition method in the embodiment of the present invention It is not forthright to be obviously improved.

The above-mentioned speech-emotion recognition method in the embodiment of the present invention is described in detail, below will be to above-mentioned The corresponding device of method is introduced.

Fig. 2 shows a kind of structures of speech emotion recognition device in the embodiment of the present invention.Participate in Fig. 2, described device 20 may include acquiring unit 201, pretreatment unit 202, parameter extraction unit 203 and recognition unit 204, wherein：

The acquiring unit 201 is suitable for obtaining pending voice signal.

Pretreatment unit 202 obtains pretreated voice letter suitable for being pre-processed to acquired voice signal Number.

Parameter extraction unit 203 is suitable for extracting the characteristic parameter of pretreated voice signal；Using the feature extracted Parameter group obtains the corresponding feature vector sequence of the voice signal at corresponding feature vector sequence；The characteristic parameter packet Include short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC The one of the maximum value of the Mel cepstrum coefficients for 20 ranks sought and the first-order difference of MFCC, the first-order difference minimum value of MFCC, MFCC The variance of the mean value of order difference and the first-order difference of MFCC.

Recognition unit 204, suitable for being instructed to the corresponding feature vector sequence of the voice signal using support vector machines Practice and identify, obtains corresponding speech emotion recognition result.

In specific implementation, the pretreatment unit 202, suitable for acquired voice signal is sampled and quantify, Preemphasis, framing adding window, short-time energy analysis and end-point detection.

In specific implementation, the pretreatment unit 202, suitable for being traversed for the multiple frames obtained after pretreatment, Obtain the present frame traversed；Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter；When true Surely the present frame and the short-time energy of the frame of continuous preset quantity thereafter traversed is all higher than or is equal to initial unvoiced segments voice When the short-time energy of signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated；When determination calculates When the ratio arrived is greater than or equal to preset threshold value, determine that the present frame traversed is the voice start frame of the voice signal.

In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies；

In an embodiment of the present invention, the fundamental frequency of the pretreated voice signal and its derivative parameter, including The fundamental frequency of the multiple frames obtained after the pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency Mean value, the variance of fundamental frequency, the shake of single order fundamental frequency, the shake of second order fundamental frequency and meet F (i) * F (i+1)！=0 The corresponding voiced sound of adjacent two frame between differential pitch；Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate i+1 frame Fundamental frequency.

In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first, Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency The maximum value of ratio and the second formant frequency ratio minimum value and the second formant frequency ratio mean value.

In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.

In an embodiment of the present invention, the parameter extraction unit 203, the formula suitable for being respectively adopted following are calculated The I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks：

Wherein, f_MelIndicate the frequency of MFCC, f_Mid-MelIndicate the frequency of Mid-MFCC, f_I-MelIndicate the frequency of I-MFCC, F indicates actual frequency

The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described The step of speech-emotion recognition method is executed when computer instruction is run.Wherein, the speech-emotion recognition method The introduction for referring to preceding sections, repeats no more.

The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory Enough computer instructions run on the processor, the processor execute the voice when running the computer instruction The step of emotion identification method.Wherein, the speech-emotion recognition method refers to the introduction of preceding sections, repeats no more.

Using the above method in the embodiment of the present invention,

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in computer readable storage medium, and storage is situated between Matter may include：ROM, RAM, disk or CD etc..

Although present disclosure is as above, present invention is not limited to this.Any those skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims

1. a kind of speech-emotion recognition method, which is characterized in that including：

Obtain pending voice signal；

Acquired voice signal is pre-processed, pretreated voice signal is obtained；

Extract the characteristic parameter of pretreated voice signal；The characteristic parameter includes short-time energy and its derivative parameter, base Voice frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrum coefficients of MFCC 20 ranks sought and The maximum value of the first-order difference of MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and the single order of MFCC The variance of difference；

Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding characteristic vector of the voice signal Sequence；

The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding language Sound emotion recognition result.

2. speech-emotion recognition method according to claim 1, which is characterized in that it is described to acquired voice signal into Row pretreatment, including：

Acquired voice signal is sampled and is quantified, preemphasis, framing adding window, short-time energy analysis and end-point detection.

3. speech-emotion recognition method according to claim 2, which is characterized in that carrying out endpoint to the voice signal When detection, voice start frame is determined by the way of following：

When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or are equal to initial nothing When the short-time energy of sound section voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated；

When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is believed for the voice Number voice start frame.

4. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language The short-time energy of sound signal and its derivative parameter, including short-time energy, the short-time energy of the multiple frames that are obtained after the pretreatment Maximum value, the minimum value of short-time energy, the mean value of short-time energy, the variance of short-time energy, short-time energy shake, short-time energy Linear regression coeffficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for all in short-term The ratio of energy.

5. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language The fundamental frequency of sound signal and its derivative parameter, including fundamental frequency, the fundamental frequency of the multiple frames that are obtained after the pretreatment Maximum value, the minimum value of fundamental frequency, the mean value of fundamental frequency, the variance of fundamental frequency, single order fundamental frequency shake, second order F (i) * F (i+1) are shaken and met to fundamental frequency！Differential pitch between=0 corresponding voiced sound of adjacent two frame；Wherein, F (i) is indicated The fundamental frequency of i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.

6. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language Each unvoiced frame in the sound quality characteristic resonances peak of sound signal and its derivative parameter, including multiple frames for being obtained after the pretreatment The maximum value of first, second, and third formant frequency, the minimum value of the first, second, and third formant frequency, first, second With the mean value of third formant frequency, the variance of the first, second, and third formant frequency and the first, second, and third formant Single order shake, the maximum value of the second formant frequency ratio and the second formant frequency ratio minimum value and the second resonance of frequency Peak frequency ratio mean value.

7. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that described to be sought to MFCC The Mel cepstrum coefficients of 20 ranks, include the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks.

8. speech-emotion recognition method according to claim 7, which is characterized in that the MFCC of 1~6 rank, 3~10 ranks The I-MFCC of Mid-MFCC and 7~12 ranks are respectively adopted following formula and are calculated：

Wherein, f_MelIndicate the frequency of MFCC, f_Mid-MelIndicate the frequency of Mid-MFCC, f_I-MelIndicate the frequency of I-MFCC, f tables Show actual frequency.

9. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that the computer instruction fortune Perform claim requires the step of 1 to 8 any one of them speech-emotion recognition method when row.

10. a kind of terminal, which is characterized in that including memory and processor, being stored on the memory can be at the place The computer instruction run on reason device, perform claim requires any one of 1 to 8 institute when the processor runs the computer instruction The step of speech-emotion recognition method stated.