CN108305639A - Speech-emotion recognition method, computer readable storage medium, terminal - Google Patents

Speech-emotion recognition method, computer readable storage medium, terminal Download PDF

Info

Publication number
CN108305639A
CN108305639A CN201810455163.7A CN201810455163A CN108305639A CN 108305639 A CN108305639 A CN 108305639A CN 201810455163 A CN201810455163 A CN 201810455163A CN 108305639 A CN108305639 A CN 108305639A
Authority
CN
China
Prior art keywords
short
mfcc
voice signal
time energy
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810455163.7A
Other languages
Chinese (zh)
Other versions
CN108305639B (en
Inventor
邓立新
王思羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810455163.7A priority Critical patent/CN108305639B/en
Publication of CN108305639A publication Critical patent/CN108305639A/en
Application granted granted Critical
Publication of CN108305639B publication Critical patent/CN108305639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

A kind of speech-emotion recognition method and device, computer readable storage medium, terminal, the method includes:Obtain pending voice signal;Acquired voice signal is pre-processed, pretreated voice signal is obtained;Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes the variance of the first-order difference of short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, the maximum value to the first-order differences of the Mel cepstrum coefficients and MFCC of MFCC 20 ranks sought, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and MFCC;Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature vector sequence of the voice signal;The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding speech emotion recognition result.Above-mentioned scheme can improve the accuracy rate of speech emotion recognition.

Description

Speech-emotion recognition method, computer readable storage medium, terminal
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of speech-emotion recognition method, computer-readable Storage medium, terminal.
Background technology
The dependence of computer is constantly enhanced with the high speed development and the mankind of information technology, the ability of human-computer interaction Increasingly paid attention to by researcher.Key factor during actually problem to be solved is exchanged with person to person in human-computer interaction Be consistent, most critical be all " emotion intelligence " ability.
Currently, during the research about Kansei Information Processing is constantly goed deep into, the Kansei Information Processing in voice signal Research be increasingly valued by people.Speech emotion recognition therein refers to and utilizes signal processing technology and pattern-recognition Method is come to Speech processing and identification, to judge that voice belongs to the technology of which kind of emotion.
But existing speech-emotion recognition method, it there is a problem that recognition accuracy is low.
Invention content
Present invention solves the technical problem that being how to improve the accuracy rate of speech emotion recognition.
In order to solve the above technical problems, an embodiment of the present invention provides a kind of speech-emotion recognition method, the method packet It includes:
Obtain pending voice signal;
Acquired voice signal is pre-processed, pretreated voice signal is obtained;
Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes short-time energy and its derivative ginseng Number, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrums of MFCC 20 ranks sought The maximum value of the first-order difference of coefficient and MFCC, the first-order difference minimum value of MFCC, the mean value and MFCC of the first-order difference of MFCC First-order difference variance;
Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature of the voice signal Vector sequence;
The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, is corresponded to Speech emotion recognition result.
Optionally, described that acquired voice signal is pre-processed, including:
Acquired voice signal is sampled and is quantified, the inspection of preemphasis, framing adding window, short-time energy analysis and endpoint It surveys.
Optionally, when carrying out end-point detection to the voice signal, voice start frame is determined by the way of following:
The multiple frames obtained after pretreatment are traversed, the present frame traversed is obtained;
Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;
When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to just When the short-time energy of beginning unvoiced segments voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;
When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is institute's predicate The voice start frame of sound signal.
Optionally, the short-time energy of the pretreated voice signal and its derivative parameter, including after the pretreatment The maximum value of short-time energy, the short-time energy of obtained multiple frames, the minimum value of short-time energy, the mean value of short-time energy, in short-term The variance of energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression coeffficient mean square error Difference and 250Hz or less short-time energies account for the ratio of whole short-time energies.
Optionally, the fundamental frequency of the pretreated voice signal and its derivative parameter, including after the pretreatment The fundamental frequency of obtained multiple frames, the maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency, fundamental tone F (i) * F (i+1) are shaken and met to the variance of frequency, the shake of single order fundamental frequency, second order fundamental frequency!=0 adjacent two frame pair Differential pitch between the voiced sound answered;Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
Optionally, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative parameter, including it is described pre- The maximum value of first, second, and third formant frequency of each unvoiced frame, the first, second and in the multiple frames obtained after processing The minimum value of third formant frequency, the mean value of the first, second, and third formant frequency, the first, second, and third formant The single order of the variance of frequency and the first, second, and third formant frequency shake, the second formant frequency ratio maximum value and Second formant frequency ratio minimum value and the second formant frequency ratio mean value.
Optionally, Mel cepstrum coefficients of 20 ranks that MFCC is sought, including the MFCC of 1~6 rank, 3~10 ranks The I-MFCC of Mid-MFCC and 7~12 ranks.
Optionally, the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks is respectively adopted following Formula is calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC, F indicates actual frequency.
The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described The step of computer instruction executes speech-emotion recognition method described in any one of the above embodiments when running.
The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory Enough computer instructions run on the processor, the processor execute any of the above-described when running the computer instruction The step of described speech-emotion recognition method.
Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that:
Above-mentioned scheme, by the way that pretreated voice signal, extraction includes short-time energy and its derivative parameter, fundamental tone Frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and exist to the Mel cepstrum coefficients of MFCC 20 ranks sought Interior characteristic parameter, the characteristic parameters of the Mel cepstrum coefficients for 20 ranks sought to MFCC extracted cover full frequency-domain, and only cover The MFCC parameters of lid low frequency are compared, and the accuracy of identification of middle and high frequency domain can be improved, so as to improve the standard of speech emotion recognition True rate promotes the usage experience of user.
Further, when carrying out end-point detection to the voice signal, the present frame traversed and thereafter is calculated first The short-time energy of the frame of continuous preset quantity, when determining the short of the present frame traversed and the frame of continuous preset quantity thereafter Shi Nengliang be all higher than or equal to initial unvoiced segments voice signal short-time energy when, calculate the present frame that traverses and next frame it Between the ratio of short-time energy determine and traverse and when determining that the ratio that is calculated is greater than or equal to preset threshold value Present frame be the voice signal voice start frame, because first to the frame of the continuous preset quantity including present frame into Whether row short-time energy is all higher than or the judgement of short-time energy equal to initial unvoiced segments voice signal, it is possible to reduce or even avoid Burr interferes the influence generated to end-point detection, therefore can improve the accuracy of end-point detection.
Description of the drawings
Fig. 1 is a kind of flow diagram of speech-emotion recognition method in the embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of speech emotion recognition device in the embodiment of the present invention.
Specific implementation mode
Technical solution in the embodiment of the present invention by pretreated voice signal, extraction include short-time energy and its Derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and to MFCC 20 ranks sought Characteristic parameter including Mel cepstrum coefficients, the characteristic parameter covering of the Mel cepstrum coefficients for 20 ranks that MFCC is sought extracted Full frequency-domain can improve the accuracy of identification of middle and high frequency domain, so as to improve language compared with the only MFCC parameters of covering low frequency The accuracy rate of sound emotion recognition promotes the usage experience of user.
It is understandable to enable above-mentioned purpose, feature and the advantageous effect of the present invention to become apparent, below in conjunction with the accompanying drawings to this The specific embodiment of invention is described in detail.
Fig. 1 is a kind of flow diagram of speech-emotion recognition method of the embodiment of the present invention.With reference to figure 1, a kind of voice Emotion identification method, the method includes:
Step S101:Obtain pending voice signal.
In specific implementation, the pending voice signal is to be obtained after carrying out analog-to-digital conversion to corresponding analog signal Digital signal.
Step S102:Acquired voice signal is pre-processed, pretreated voice signal is obtained.
In specific implementation, when being pre-processed to acquired voice signal, first to acquired voice signal into Row sampling and quantization, preemphasis, framing adding window, short-time energy analysis and end-point detection.
Using the digital filter μ of single order when in an embodiment of the present invention, to voice signal progress preemphasis processing:H (z)=1- μ z-1, μ takes 0.98;For the frame length used when framing for 320, it is 80 that frame, which moves,;Adding window is using Hamming window.
In an embodiment of the present invention, in order to improve the accuracy of short point detection, using two-part detection method to language end Point is detected.Specifically:
It is possible, firstly, to be traversed in sequence to the voice signal of the multiple frames obtained after pretreatment, calculating traverses Present frame the voice signal voice signal of the frame of continuous preset quantity thereafter short-time energy, and it is current by what is traversed The short-time energy of frame and the thereafter short-time energy of the frame of continuous preset quantity respectively with initial unvoiced segments voice signal is compared Compared with to determine whether the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to initial The short-time energy of unvoiced segments voice signal.
Then, when the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or wait When the short-time energy of initial unvoiced segments voice signal, it can calculate and calculate between the present frame and next frame that traverse in short-term The ratio of energy, and when determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is The voice start frame of the voice signal.
For example, be x (l) with the time-domain expression of voice signal, the n-th frame language obtained after the pretreatments such as framing adding window Sound signal is xn(m) for, then its short-time energy EnFor:
Wherein, N indicates the frame length of each frame.
First segment is detected as detecting the voice signal of continuous five frame, judges that the short-time energy of the voice signal of continuous five frame is No satisfaction:
Ei>=Ti (IS), i ∈ { m, m+1, m+2, m+3, m+4 } (2)
Wherein, IS is the average duration of the initial unvoiced segments of voice, Ti(IS) initial unvoiced segments voice signal is indicated in short-term Energy size, as ambient noise.
It is detected by above-mentioned first segment, the influence generated to end-point detection can be interfered to avoid burr.
When first segment detects errorless, namely determine that the short-time energy of the voice signal of continuous five frame is all higher than or equal to initial When the short-time energy of unvoiced segments voice signal, then enters second segment ratio and adjudicate, i.e.,:
Wherein, if σnFor second segment detection threshold, i.e., preset threshold value, σnIndicate latter in adjacent two frame voice signals Ratio between the voice signal short-time energy of frame and the short-time energy of the voice signal of former frame, for judging the starting of voice Section.Meet the n-th frame voice data of above formula, the as start frame of voice.
It uses two-part detection method to be detected language endpoint, the accuracy of voice start frame detection can be improved, So as to improve the accuracy of speech emotion recognition.
Step S103:Extract the characteristic parameter of pretreated voice signal;The characteristic parameter include short-time energy and It derives parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC 20 ranks sought The maximum value of the first-order difference of Mel cepstrum coefficients and MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC With the variance of the first-order difference of MFCC.
In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies.
Wherein, because the expression of the energy of voice signal and emotion has strong correlation, calculating formula (1) is used first respectively The short-time energy of the voice signal of multiple frames after pretreatment is calculated, and the voice signal for seeking multiple frames after pretreatment in short-term can The maximum value of short-time energy, the variance of the minimum value of short-time energy, the mean value of short-time energy and short-time energy in amount.
Then, the linear of the short-time energy shake, short-time energy of the voice signal of multiple frames after pre-processing is calculated to return Return coefficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for the ratios of whole short-time energies Example.Wherein:
In an embodiment of the present invention, the short-time energy shake of the voice signal of multiple frames uses following public affairs after pretreatment Formula calculates:
Wherein, EsIndicate that short-time energy shake, M indicate totalframes.
In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames is adopted after pretreatment It is calculated with following formula:
Wherein, ErIndicate that short-time energy linear regression coeffficient, M indicate totalframes.
In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames after pretreatment Mean square error is calculated using following formula:
Wherein, EqIndicate the mean square error of the linear regression coeffficient of short-time energy.
In an embodiment of the present invention, 250Hz or less in the short-time energy of the voice signal of multiple frames after the pretreatment The ratio that short-time energy accounts for whole short-time energies is:
Wherein, E250Indicate the sum of 250Hz short-time energies below in a frequency domain.
In an embodiment of the present invention, the fundamental frequency of pretreated voice signal and its derivative parameter, including it is described The fundamental frequency of the multiple frames obtained after pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency it is equal F (i) * F (i+1) are shaken and met to value, the variance of fundamental frequency, the shake of single order fundamental frequency, second order fundamental frequency!=0 phase Differential pitch between the corresponding voiced sound of two frames of neighbour.Wherein:
The frequency that vocal cords shake is known as fundamental frequency.In an embodiment of the present invention, using short-time autocorrelation function come Obtain fundamental frequency.Specifically, the corresponding auto-correlation coefficient R of voice signal of pretreated each frame is defined firstn(k), Again by detecting Rn(k) position of peak value extracts corresponding pitch period value, then the pitch period value ball inverse to obtaining is i.e. Obtain corresponding fundamental frequency.
In an embodiment of the present invention, for the voice signal x of pretreated n-th framen(m), auto-correlation function Rn (k) it is:
Wherein, Rn(k) it is even function, the range k=(- N+1) that value is not zero~(N-1).
When seeking the fundamental frequency of voice signal of pretreated multiple frames, can obtain pretreated multiple The maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency and base in the fundamental frequency of the voice signal of frame The variance of voice frequency.
In an embodiment of the present invention, by the fundamental tone frequency of i-th of unvoiced frame in the voice signal of treated multiple frames Rate is expressed as F0i, the unvoiced frame sum of the voice signal of treated multiple frames is expressed as M*, the voice of treated multiple frames The totalframes of signal is expressed as M, then corresponding single order fundamental frequency shake and the shake of second order fundamental frequency is respectively:
Wherein, F0s1Indicate the single order fundamental frequency shake, F0s2Indicate the second order fundamental frequency shake.
In an embodiment of the present invention, all in the voice signal of preset multiple frames to meet F (i) * F (i+1)!=0 Among two adjacent frames, differential pitch dF is between corresponding voiced sound:
DF (k)=F (i)-F (i+1), 1≤k≤M*, 1≤i≤M (11)
Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first, Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency The mean value of the minimum value and the second formant frequency ratio of the maximum value of ratio and the second formant frequency ratio.
In specific implementation, the formant parameter includes resonance peak bandwidth and frequency, and the basis of the two extraction is to language The spectrum envelope of sound signal is estimated, and formant parameter is estimated from channel model using linear prediction (LPC) method.
In an embodiment of the present invention, uncoiling is carried out to voice signal with LPC methods, obtains the all-pole model ginseng of sound channel response Number is:
Then, a root of prediction error filter A (z) is found out, then the corresponding formant frequencies of i are:
Wherein, T is the sampling period.
First, second and third formant frequency of i-th of unvoiced frame is expressed as F1i、F2i、F3i, then the second formant Frequency ratio is:
F2i/(F2i-F1i) (14)
When the corresponding first, second, third formant frequency of the voice signal that multiple frames are calculated using above-mentioned formula When rate, it is total that first, second, third can be sought in corresponding first, second, third formant frequency of voice signal of multiple frames The maximum value of peak frequency of shaking, minimum value, mean value, variance and the maximum value of single order shake and the second formant frequency ratio, most Small value and mean value.
In an embodiment of the present invention, in order to improve the accuracy of identification, in the extraction of Meier (Mel) frequency cepstral coefficient In the non-linear relation of Hz-Mel is corrected, introduce 2 new coefficient Mid-MFCC and I-MFCC.Mid-MFCC and I-MFCC has good computational accuracy in medium, high frequency region respectively, can be used as the supplement to low order MFCC, realizes to full frequency-domain Spectrum signature calculated.Wherein, the filter of Mid-MFCC filters group is distributed more intensive in mid-frequency region, low Frequently, high-frequency region is more sparse;I-MFCC is inverse Mel frequency cepstral coefficients, and the filter of the filter group of I-MFCC is in low frequency Area distribution is sparse, and high-frequency region distribution is more intensive.
In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.In an embodiment of the present invention, the MFCC of 1~6 rank, 3~10 The Mid-MFCC of the rank and I-MFCC of 7~12 ranks is respectively adopted following formula and is calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC, F indicates actual frequency.
Finally, the characteristic parameter of improved MFCC by this 20 rank Mel cepstrum parameters and Mel cepstrum coefficients (MFCC) one Maximum value, minimum value, mean value and the variance composition of order difference.
Step S104:Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the voice signal Corresponding feature vector sequence.
In specific implementation, when extracting the characteristic parameter of pretreated multiple frames, the characteristic parameter extracted is pressed According to character pair vector sequence is sequentially combined into, to obtain the corresponding feature vector sequence of the voice signal.
Step S105:The corresponding feature vector sequence of the voice signal is trained using support vector machines (SVM) And identification, obtain corresponding speech emotion recognition result.
In specific implementation, when obtaining the corresponding feature vector sequence of the voice signal, the voice can be believed Number corresponding feature vector sequence is trained and is identified using support vector machines (SVM), to obtain corresponding speech emotional Recognition result.
In an embodiment of the present invention, choose support vector machines core letter be radial basis function (RBF), it is used support to Amount machine grader is the 5 class support vector machines graders of " one-vs-one " pattern.
Specifically, during Training Support Vector Machines, five kinds of emotions are identified, according to " one-vs-one " plan 10 support vector machine classifiers can be slightly built, are " indignation-fear ", " indignation-is sad ", " indignation-is neutral ", " anger respectively Anger-happiness ", " fearing-sadness ", " fearing-neutrality ", " fearing-happiness ", " sad-neutral ", " sadness-happiness ", " neutrality-is high It is emerging " grader.
Then, the training set sample number that each mood is arranged is 150, and test set sample number is 50, will be above-mentioned The feature vector sequence that characteristic parameter composition is extracted in step is input to 10 support vector machine classifiers that training obtains.
Using speech-emotion recognition method and the speech-emotion recognition method institute in the prior art in the embodiment of the present invention The Experimental comparison results of the recognition accuracy of emotion recognition are obtained, respectively as shown in the following table 1, table 2:
Table 1
Table 2
Pass through the comparison of above-mentioned table, it can be seen that the accurate knowledge of the speech-emotion recognition method in the embodiment of the present invention It is not forthright to be obviously improved.
The above-mentioned speech-emotion recognition method in the embodiment of the present invention is described in detail, below will be to above-mentioned The corresponding device of method is introduced.
Fig. 2 shows a kind of structures of speech emotion recognition device in the embodiment of the present invention.Participate in Fig. 2, described device 20 may include acquiring unit 201, pretreatment unit 202, parameter extraction unit 203 and recognition unit 204, wherein:
The acquiring unit 201 is suitable for obtaining pending voice signal.
Pretreatment unit 202 obtains pretreated voice letter suitable for being pre-processed to acquired voice signal Number.
Parameter extraction unit 203 is suitable for extracting the characteristic parameter of pretreated voice signal;Using the feature extracted Parameter group obtains the corresponding feature vector sequence of the voice signal at corresponding feature vector sequence;The characteristic parameter packet Include short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC The one of the maximum value of the Mel cepstrum coefficients for 20 ranks sought and the first-order difference of MFCC, the first-order difference minimum value of MFCC, MFCC The variance of the mean value of order difference and the first-order difference of MFCC.
Recognition unit 204, suitable for being instructed to the corresponding feature vector sequence of the voice signal using support vector machines Practice and identify, obtains corresponding speech emotion recognition result.
In specific implementation, the pretreatment unit 202, suitable for acquired voice signal is sampled and quantify, Preemphasis, framing adding window, short-time energy analysis and end-point detection.
In specific implementation, the pretreatment unit 202, suitable for being traversed for the multiple frames obtained after pretreatment, Obtain the present frame traversed;Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;When true Surely the present frame and the short-time energy of the frame of continuous preset quantity thereafter traversed is all higher than or is equal to initial unvoiced segments voice When the short-time energy of signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;When determination calculates When the ratio arrived is greater than or equal to preset threshold value, determine that the present frame traversed is the voice start frame of the voice signal.
In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies;
In an embodiment of the present invention, the fundamental frequency of the pretreated voice signal and its derivative parameter, including The fundamental frequency of the multiple frames obtained after the pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency Mean value, the variance of fundamental frequency, the shake of single order fundamental frequency, the shake of second order fundamental frequency and meet F (i) * F (i+1)!=0 The corresponding voiced sound of adjacent two frame between differential pitch;Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate i+1 frame Fundamental frequency.
In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first, Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency The maximum value of ratio and the second formant frequency ratio minimum value and the second formant frequency ratio mean value.
In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.
In an embodiment of the present invention, the parameter extraction unit 203, the formula suitable for being respectively adopted following are calculated The I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC, F indicates actual frequency
The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described The step of speech-emotion recognition method is executed when computer instruction is run.Wherein, the speech-emotion recognition method The introduction for referring to preceding sections, repeats no more.
The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory Enough computer instructions run on the processor, the processor execute the voice when running the computer instruction The step of emotion identification method.Wherein, the speech-emotion recognition method refers to the introduction of preceding sections, repeats no more.
Using the above method in the embodiment of the present invention,
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in computer readable storage medium, and storage is situated between Matter may include:ROM, RAM, disk or CD etc..
Although present disclosure is as above, present invention is not limited to this.Any those skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims (10)

1. a kind of speech-emotion recognition method, which is characterized in that including:
Obtain pending voice signal;
Acquired voice signal is pre-processed, pretreated voice signal is obtained;
Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes short-time energy and its derivative parameter, base Voice frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrum coefficients of MFCC 20 ranks sought and The maximum value of the first-order difference of MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and the single order of MFCC The variance of difference;
Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding characteristic vector of the voice signal Sequence;
The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding language Sound emotion recognition result.
2. speech-emotion recognition method according to claim 1, which is characterized in that it is described to acquired voice signal into Row pretreatment, including:
Acquired voice signal is sampled and is quantified, preemphasis, framing adding window, short-time energy analysis and end-point detection.
3. speech-emotion recognition method according to claim 2, which is characterized in that carrying out endpoint to the voice signal When detection, voice start frame is determined by the way of following:
The multiple frames obtained after pretreatment are traversed, the present frame traversed is obtained;
Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;
When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or are equal to initial nothing When the short-time energy of sound section voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;
When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is believed for the voice Number voice start frame.
4. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language The short-time energy of sound signal and its derivative parameter, including short-time energy, the short-time energy of the multiple frames that are obtained after the pretreatment Maximum value, the minimum value of short-time energy, the mean value of short-time energy, the variance of short-time energy, short-time energy shake, short-time energy Linear regression coeffficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for all in short-term The ratio of energy.
5. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language The fundamental frequency of sound signal and its derivative parameter, including fundamental frequency, the fundamental frequency of the multiple frames that are obtained after the pretreatment Maximum value, the minimum value of fundamental frequency, the mean value of fundamental frequency, the variance of fundamental frequency, single order fundamental frequency shake, second order F (i) * F (i+1) are shaken and met to fundamental frequency!Differential pitch between=0 corresponding voiced sound of adjacent two frame;Wherein, F (i) is indicated The fundamental frequency of i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
6. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language Each unvoiced frame in the sound quality characteristic resonances peak of sound signal and its derivative parameter, including multiple frames for being obtained after the pretreatment The maximum value of first, second, and third formant frequency, the minimum value of the first, second, and third formant frequency, first, second With the mean value of third formant frequency, the variance of the first, second, and third formant frequency and the first, second, and third formant Single order shake, the maximum value of the second formant frequency ratio and the second formant frequency ratio minimum value and the second resonance of frequency Peak frequency ratio mean value.
7. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that described to be sought to MFCC The Mel cepstrum coefficients of 20 ranks, include the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks.
8. speech-emotion recognition method according to claim 7, which is characterized in that the MFCC of 1~6 rank, 3~10 ranks The I-MFCC of Mid-MFCC and 7~12 ranks are respectively adopted following formula and are calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC, f tables Show actual frequency.
9. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that the computer instruction fortune Perform claim requires the step of 1 to 8 any one of them speech-emotion recognition method when row.
10. a kind of terminal, which is characterized in that including memory and processor, being stored on the memory can be at the place The computer instruction run on reason device, perform claim requires any one of 1 to 8 institute when the processor runs the computer instruction The step of speech-emotion recognition method stated.
CN201810455163.7A 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal Active CN108305639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810455163.7A CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810455163.7A CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Publications (2)

Publication Number Publication Date
CN108305639A true CN108305639A (en) 2018-07-20
CN108305639B CN108305639B (en) 2021-03-09

Family

ID=62846586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810455163.7A Active CN108305639B (en) 2018-05-11 2018-05-11 Speech emotion recognition method, computer-readable storage medium and terminal

Country Status (1)

Country Link
CN (1) CN108305639B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243492A (en) * 2018-10-28 2019-01-18 国家计算机网络与信息安全管理中心 A kind of speech emotion recognition system and recognition methods
CN109300339A (en) * 2018-11-19 2019-02-01 王泓懿 A kind of exercising method and system of Oral English Practice
CN109946055A (en) * 2019-03-22 2019-06-28 武汉源海博创科技有限公司 A kind of sliding rail of automobile seat abnormal sound detection method and system
CN110491417A (en) * 2019-08-09 2019-11-22 北京影谱科技股份有限公司 Speech-emotion recognition method and device based on deep learning
CN111243627A (en) * 2020-01-13 2020-06-05 云知声智能科技股份有限公司 Voice emotion recognition method and device
CN111768800A (en) * 2020-06-23 2020-10-13 中兴通讯股份有限公司 Voice signal processing method, apparatus and storage medium
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578470A (en) * 2012-08-09 2014-02-12 安徽科大讯飞信息科技股份有限公司 Telephone recording data processing method and system
US9495954B2 (en) * 2010-08-06 2016-11-15 At&T Intellectual Property I, L.P. System and method of synthetic voice generation and modification
CN106448659A (en) * 2016-12-19 2017-02-22 广东工业大学 Speech endpoint detection method based on short-time energy and fractal dimensions
CN106887241A (en) * 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 A kind of voice signal detection method and device
CN108010515A (en) * 2017-11-21 2018-05-08 清华大学 A kind of speech terminals detection and awakening method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9495954B2 (en) * 2010-08-06 2016-11-15 At&T Intellectual Property I, L.P. System and method of synthetic voice generation and modification
CN103578470A (en) * 2012-08-09 2014-02-12 安徽科大讯飞信息科技股份有限公司 Telephone recording data processing method and system
CN106887241A (en) * 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 A kind of voice signal detection method and device
CN106448659A (en) * 2016-12-19 2017-02-22 广东工业大学 Speech endpoint detection method based on short-time energy and fractal dimensions
CN108010515A (en) * 2017-11-21 2018-05-08 清华大学 A kind of speech terminals detection and awakening method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵力等: "实用语音情感识别中的若干关键技术", 《数据采集与处理》 *
韩一等: "基于MFCC的语音情感识别", 《重庆邮电大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243492A (en) * 2018-10-28 2019-01-18 国家计算机网络与信息安全管理中心 A kind of speech emotion recognition system and recognition methods
CN109300339A (en) * 2018-11-19 2019-02-01 王泓懿 A kind of exercising method and system of Oral English Practice
CN109946055A (en) * 2019-03-22 2019-06-28 武汉源海博创科技有限公司 A kind of sliding rail of automobile seat abnormal sound detection method and system
CN109946055B (en) * 2019-03-22 2021-01-12 宁波慧声智创科技有限公司 Method and system for detecting abnormal sound of automobile seat slide rail
CN110491417A (en) * 2019-08-09 2019-11-22 北京影谱科技股份有限公司 Speech-emotion recognition method and device based on deep learning
CN111243627A (en) * 2020-01-13 2020-06-05 云知声智能科技股份有限公司 Voice emotion recognition method and device
CN111243627B (en) * 2020-01-13 2022-09-27 云知声智能科技股份有限公司 Voice emotion recognition method and device
CN111768800A (en) * 2020-06-23 2020-10-13 中兴通讯股份有限公司 Voice signal processing method, apparatus and storage medium
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN113807249B (en) * 2021-09-17 2024-01-12 广州大学 Emotion recognition method, system, device and medium based on multi-mode feature fusion

Also Published As

Publication number Publication date
CN108305639B (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN108305639A (en) Speech-emotion recognition method, computer readable storage medium, terminal
CN108682432A (en) Speech emotion recognition device
CN107610715B (en) Similarity calculation method based on multiple sound characteristics
Kulmer et al. Phase estimation in single channel speech enhancement using phase decomposition
CN106486131A (en) A kind of method and device of speech de-noising
CN104700843A (en) Method and device for identifying ages
Hui et al. Convolutional maxout neural networks for speech separation
CN109448726A (en) A kind of method of adjustment and system of voice control accuracy rate
Mittal et al. Study of characteristics of aperiodicity in Noh voices
Yuan A time–frequency smoothing neural network for speech enhancement
CN110136709A (en) Audio recognition method and video conferencing system based on speech recognition
Jiao et al. Convex weighting criteria for speaking rate estimation
CN108288465A (en) Intelligent sound cuts the method for axis, information data processing terminal, computer program
Archana et al. Gender identification and performance analysis of speech signals
Morales-Cordovilla et al. Feature extraction based on pitch-synchronous averaging for robust speech recognition
CN108269574A (en) Voice signal processing method and device, storage medium and electronic equipment
Hanilçi et al. Comparing spectrum estimators in speaker verification under additive noise degradation
Sinha et al. On the use of pitch normalization for improving children's speech recognition
Huang et al. DNN-based speech enhancement using MBE model
CN106356076A (en) Method and device for detecting voice activity on basis of artificial intelligence
CN111755029B (en) Voice processing method, device, storage medium and electronic equipment
Lu Reduction of musical residual noise using block-and-directional-median filter adapted by harmonic properties
Jameel et al. Noise robust formant frequency estimation method based on spectral model of repeated autocorrelation of speech
Thirumuru et al. Application of non-negative frequency-weighted energy operator for vowel region detection
Shome et al. Non-negative frequency-weighted energy-based speech quality estimation for different modes and quality of speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant