CN108305639A - Speech-emotion recognition method, computer readable storage medium, terminal - Google Patents
Speech-emotion recognition method, computer readable storage medium, terminal Download PDFInfo
- Publication number
- CN108305639A CN108305639A CN201810455163.7A CN201810455163A CN108305639A CN 108305639 A CN108305639 A CN 108305639A CN 201810455163 A CN201810455163 A CN 201810455163A CN 108305639 A CN108305639 A CN 108305639A
- Authority
- CN
- China
- Prior art keywords
- short
- mfcc
- voice signal
- time energy
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000008909 emotion recognition Effects 0.000 claims abstract description 15
- 238000012706 support-vector machine Methods 0.000 claims abstract description 13
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000001514 detection method Methods 0.000 claims description 14
- 238000012417 linear regression Methods 0.000 claims description 13
- 238000009432 framing Methods 0.000 claims description 6
- 230000005236 sound signal Effects 0.000 claims description 6
- 108010076504 Protein Sorting Signals Proteins 0.000 claims 1
- 238000000605 extraction Methods 0.000 description 7
- 230000008451 emotion Effects 0.000 description 6
- 241000208340 Araliaceae Species 0.000 description 4
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 4
- 235000003140 Panax quinquefolius Nutrition 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 235000008434 ginseng Nutrition 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000005311 autocorrelation function Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- -1 (i)-F Chemical class 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000012567 pattern recognition method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- LPXPTNMVRIOKMN-UHFFFAOYSA-M sodium nitrite Substances [Na+].[O-]N=O LPXPTNMVRIOKMN-UHFFFAOYSA-M 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Auxiliary Devices For Music (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
A kind of speech-emotion recognition method and device, computer readable storage medium, terminal, the method includes:Obtain pending voice signal;Acquired voice signal is pre-processed, pretreated voice signal is obtained;Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes the variance of the first-order difference of short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, the maximum value to the first-order differences of the Mel cepstrum coefficients and MFCC of MFCC 20 ranks sought, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and MFCC;Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature vector sequence of the voice signal;The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding speech emotion recognition result.Above-mentioned scheme can improve the accuracy rate of speech emotion recognition.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of speech-emotion recognition method, computer-readable
Storage medium, terminal.
Background technology
The dependence of computer is constantly enhanced with the high speed development and the mankind of information technology, the ability of human-computer interaction
Increasingly paid attention to by researcher.Key factor during actually problem to be solved is exchanged with person to person in human-computer interaction
Be consistent, most critical be all " emotion intelligence " ability.
Currently, during the research about Kansei Information Processing is constantly goed deep into, the Kansei Information Processing in voice signal
Research be increasingly valued by people.Speech emotion recognition therein refers to and utilizes signal processing technology and pattern-recognition
Method is come to Speech processing and identification, to judge that voice belongs to the technology of which kind of emotion.
But existing speech-emotion recognition method, it there is a problem that recognition accuracy is low.
Invention content
Present invention solves the technical problem that being how to improve the accuracy rate of speech emotion recognition.
In order to solve the above technical problems, an embodiment of the present invention provides a kind of speech-emotion recognition method, the method packet
It includes:
Obtain pending voice signal;
Acquired voice signal is pre-processed, pretreated voice signal is obtained;
Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes short-time energy and its derivative ginseng
Number, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrums of MFCC 20 ranks sought
The maximum value of the first-order difference of coefficient and MFCC, the first-order difference minimum value of MFCC, the mean value and MFCC of the first-order difference of MFCC
First-order difference variance;
Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding feature of the voice signal
Vector sequence;
The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, is corresponded to
Speech emotion recognition result.
Optionally, described that acquired voice signal is pre-processed, including:
Acquired voice signal is sampled and is quantified, the inspection of preemphasis, framing adding window, short-time energy analysis and endpoint
It surveys.
Optionally, when carrying out end-point detection to the voice signal, voice start frame is determined by the way of following:
The multiple frames obtained after pretreatment are traversed, the present frame traversed is obtained;
Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;
When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to just
When the short-time energy of beginning unvoiced segments voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;
When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is institute's predicate
The voice start frame of sound signal.
Optionally, the short-time energy of the pretreated voice signal and its derivative parameter, including after the pretreatment
The maximum value of short-time energy, the short-time energy of obtained multiple frames, the minimum value of short-time energy, the mean value of short-time energy, in short-term
The variance of energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression coeffficient mean square error
Difference and 250Hz or less short-time energies account for the ratio of whole short-time energies.
Optionally, the fundamental frequency of the pretreated voice signal and its derivative parameter, including after the pretreatment
The fundamental frequency of obtained multiple frames, the maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency, fundamental tone
F (i) * F (i+1) are shaken and met to the variance of frequency, the shake of single order fundamental frequency, second order fundamental frequency!=0 adjacent two frame pair
Differential pitch between the voiced sound answered;Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
Optionally, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative parameter, including it is described pre-
The maximum value of first, second, and third formant frequency of each unvoiced frame, the first, second and in the multiple frames obtained after processing
The minimum value of third formant frequency, the mean value of the first, second, and third formant frequency, the first, second, and third formant
The single order of the variance of frequency and the first, second, and third formant frequency shake, the second formant frequency ratio maximum value and
Second formant frequency ratio minimum value and the second formant frequency ratio mean value.
Optionally, Mel cepstrum coefficients of 20 ranks that MFCC is sought, including the MFCC of 1~6 rank, 3~10 ranks
The I-MFCC of Mid-MFCC and 7~12 ranks.
Optionally, the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks is respectively adopted following
Formula is calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC,
F indicates actual frequency.
The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described
The step of computer instruction executes speech-emotion recognition method described in any one of the above embodiments when running.
The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory
Enough computer instructions run on the processor, the processor execute any of the above-described when running the computer instruction
The step of described speech-emotion recognition method.
Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that:
Above-mentioned scheme, by the way that pretreated voice signal, extraction includes short-time energy and its derivative parameter, fundamental tone
Frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and exist to the Mel cepstrum coefficients of MFCC 20 ranks sought
Interior characteristic parameter, the characteristic parameters of the Mel cepstrum coefficients for 20 ranks sought to MFCC extracted cover full frequency-domain, and only cover
The MFCC parameters of lid low frequency are compared, and the accuracy of identification of middle and high frequency domain can be improved, so as to improve the standard of speech emotion recognition
True rate promotes the usage experience of user.
Further, when carrying out end-point detection to the voice signal, the present frame traversed and thereafter is calculated first
The short-time energy of the frame of continuous preset quantity, when determining the short of the present frame traversed and the frame of continuous preset quantity thereafter
Shi Nengliang be all higher than or equal to initial unvoiced segments voice signal short-time energy when, calculate the present frame that traverses and next frame it
Between the ratio of short-time energy determine and traverse and when determining that the ratio that is calculated is greater than or equal to preset threshold value
Present frame be the voice signal voice start frame, because first to the frame of the continuous preset quantity including present frame into
Whether row short-time energy is all higher than or the judgement of short-time energy equal to initial unvoiced segments voice signal, it is possible to reduce or even avoid
Burr interferes the influence generated to end-point detection, therefore can improve the accuracy of end-point detection.
Description of the drawings
Fig. 1 is a kind of flow diagram of speech-emotion recognition method in the embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of speech emotion recognition device in the embodiment of the present invention.
Specific implementation mode
Technical solution in the embodiment of the present invention by pretreated voice signal, extraction include short-time energy and its
Derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter and to MFCC 20 ranks sought
Characteristic parameter including Mel cepstrum coefficients, the characteristic parameter covering of the Mel cepstrum coefficients for 20 ranks that MFCC is sought extracted
Full frequency-domain can improve the accuracy of identification of middle and high frequency domain, so as to improve language compared with the only MFCC parameters of covering low frequency
The accuracy rate of sound emotion recognition promotes the usage experience of user.
It is understandable to enable above-mentioned purpose, feature and the advantageous effect of the present invention to become apparent, below in conjunction with the accompanying drawings to this
The specific embodiment of invention is described in detail.
Fig. 1 is a kind of flow diagram of speech-emotion recognition method of the embodiment of the present invention.With reference to figure 1, a kind of voice
Emotion identification method, the method includes:
Step S101:Obtain pending voice signal.
In specific implementation, the pending voice signal is to be obtained after carrying out analog-to-digital conversion to corresponding analog signal
Digital signal.
Step S102:Acquired voice signal is pre-processed, pretreated voice signal is obtained.
In specific implementation, when being pre-processed to acquired voice signal, first to acquired voice signal into
Row sampling and quantization, preemphasis, framing adding window, short-time energy analysis and end-point detection.
Using the digital filter μ of single order when in an embodiment of the present invention, to voice signal progress preemphasis processing:H
(z)=1- μ z-1, μ takes 0.98;For the frame length used when framing for 320, it is 80 that frame, which moves,;Adding window is using Hamming window.
In an embodiment of the present invention, in order to improve the accuracy of short point detection, using two-part detection method to language end
Point is detected.Specifically:
It is possible, firstly, to be traversed in sequence to the voice signal of the multiple frames obtained after pretreatment, calculating traverses
Present frame the voice signal voice signal of the frame of continuous preset quantity thereafter short-time energy, and it is current by what is traversed
The short-time energy of frame and the thereafter short-time energy of the frame of continuous preset quantity respectively with initial unvoiced segments voice signal is compared
Compared with to determine whether the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or equal to initial
The short-time energy of unvoiced segments voice signal.
Then, when the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or wait
When the short-time energy of initial unvoiced segments voice signal, it can calculate and calculate between the present frame and next frame that traverse in short-term
The ratio of energy, and when determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is
The voice start frame of the voice signal.
For example, be x (l) with the time-domain expression of voice signal, the n-th frame language obtained after the pretreatments such as framing adding window
Sound signal is xn(m) for, then its short-time energy EnFor:
Wherein, N indicates the frame length of each frame.
First segment is detected as detecting the voice signal of continuous five frame, judges that the short-time energy of the voice signal of continuous five frame is
No satisfaction:
Ei>=Ti (IS), i ∈ { m, m+1, m+2, m+3, m+4 } (2)
Wherein, IS is the average duration of the initial unvoiced segments of voice, Ti(IS) initial unvoiced segments voice signal is indicated in short-term
Energy size, as ambient noise.
It is detected by above-mentioned first segment, the influence generated to end-point detection can be interfered to avoid burr.
When first segment detects errorless, namely determine that the short-time energy of the voice signal of continuous five frame is all higher than or equal to initial
When the short-time energy of unvoiced segments voice signal, then enters second segment ratio and adjudicate, i.e.,:
Wherein, if σnFor second segment detection threshold, i.e., preset threshold value, σnIndicate latter in adjacent two frame voice signals
Ratio between the voice signal short-time energy of frame and the short-time energy of the voice signal of former frame, for judging the starting of voice
Section.Meet the n-th frame voice data of above formula, the as start frame of voice.
It uses two-part detection method to be detected language endpoint, the accuracy of voice start frame detection can be improved,
So as to improve the accuracy of speech emotion recognition.
Step S103:Extract the characteristic parameter of pretreated voice signal;The characteristic parameter include short-time energy and
It derives parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC 20 ranks sought
The maximum value of the first-order difference of Mel cepstrum coefficients and MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC
With the variance of the first-order difference of MFCC.
In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including
The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy
Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression
Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies.
Wherein, because the expression of the energy of voice signal and emotion has strong correlation, calculating formula (1) is used first respectively
The short-time energy of the voice signal of multiple frames after pretreatment is calculated, and the voice signal for seeking multiple frames after pretreatment in short-term can
The maximum value of short-time energy, the variance of the minimum value of short-time energy, the mean value of short-time energy and short-time energy in amount.
Then, the linear of the short-time energy shake, short-time energy of the voice signal of multiple frames after pre-processing is calculated to return
Return coefficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for the ratios of whole short-time energies
Example.Wherein:
In an embodiment of the present invention, the short-time energy shake of the voice signal of multiple frames uses following public affairs after pretreatment
Formula calculates:
Wherein, EsIndicate that short-time energy shake, M indicate totalframes.
In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames is adopted after pretreatment
It is calculated with following formula:
Wherein, ErIndicate that short-time energy linear regression coeffficient, M indicate totalframes.
In an embodiment of the present invention, the linear regression coeffficient of the short-time energy of the voice signal of multiple frames after pretreatment
Mean square error is calculated using following formula:
Wherein, EqIndicate the mean square error of the linear regression coeffficient of short-time energy.
In an embodiment of the present invention, 250Hz or less in the short-time energy of the voice signal of multiple frames after the pretreatment
The ratio that short-time energy accounts for whole short-time energies is:
Wherein, E250Indicate the sum of 250Hz short-time energies below in a frequency domain.
In an embodiment of the present invention, the fundamental frequency of pretreated voice signal and its derivative parameter, including it is described
The fundamental frequency of the multiple frames obtained after pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency it is equal
F (i) * F (i+1) are shaken and met to value, the variance of fundamental frequency, the shake of single order fundamental frequency, second order fundamental frequency!=0 phase
Differential pitch between the corresponding voiced sound of two frames of neighbour.Wherein:
The frequency that vocal cords shake is known as fundamental frequency.In an embodiment of the present invention, using short-time autocorrelation function come
Obtain fundamental frequency.Specifically, the corresponding auto-correlation coefficient R of voice signal of pretreated each frame is defined firstn(k),
Again by detecting Rn(k) position of peak value extracts corresponding pitch period value, then the pitch period value ball inverse to obtaining is i.e.
Obtain corresponding fundamental frequency.
In an embodiment of the present invention, for the voice signal x of pretreated n-th framen(m), auto-correlation function Rn
(k) it is:
Wherein, Rn(k) it is even function, the range k=(- N+1) that value is not zero~(N-1).
When seeking the fundamental frequency of voice signal of pretreated multiple frames, can obtain pretreated multiple
The maximum value of fundamental frequency, the minimum value of fundamental frequency, the mean value of fundamental frequency and base in the fundamental frequency of the voice signal of frame
The variance of voice frequency.
In an embodiment of the present invention, by the fundamental tone frequency of i-th of unvoiced frame in the voice signal of treated multiple frames
Rate is expressed as F0i, the unvoiced frame sum of the voice signal of treated multiple frames is expressed as M*, the voice of treated multiple frames
The totalframes of signal is expressed as M, then corresponding single order fundamental frequency shake and the shake of second order fundamental frequency is respectively:
Wherein, F0s1Indicate the single order fundamental frequency shake, F0s2Indicate the second order fundamental frequency shake.
In an embodiment of the present invention, all in the voice signal of preset multiple frames to meet F (i) * F (i+1)!=0
Among two adjacent frames, differential pitch dF is between corresponding voiced sound:
DF (k)=F (i)-F (i+1), 1≤k≤M*, 1≤i≤M (11)
Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng
It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment
Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first,
Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency
The mean value of the minimum value and the second formant frequency ratio of the maximum value of ratio and the second formant frequency ratio.
In specific implementation, the formant parameter includes resonance peak bandwidth and frequency, and the basis of the two extraction is to language
The spectrum envelope of sound signal is estimated, and formant parameter is estimated from channel model using linear prediction (LPC) method.
In an embodiment of the present invention, uncoiling is carried out to voice signal with LPC methods, obtains the all-pole model ginseng of sound channel response
Number is:
Then, a root of prediction error filter A (z) is found out, then the corresponding formant frequencies of i are:
Wherein, T is the sampling period.
First, second and third formant frequency of i-th of unvoiced frame is expressed as F1i、F2i、F3i, then the second formant
Frequency ratio is:
F2i/(F2i-F1i) (14)
When the corresponding first, second, third formant frequency of the voice signal that multiple frames are calculated using above-mentioned formula
When rate, it is total that first, second, third can be sought in corresponding first, second, third formant frequency of voice signal of multiple frames
The maximum value of peak frequency of shaking, minimum value, mean value, variance and the maximum value of single order shake and the second formant frequency ratio, most
Small value and mean value.
In an embodiment of the present invention, in order to improve the accuracy of identification, in the extraction of Meier (Mel) frequency cepstral coefficient
In the non-linear relation of Hz-Mel is corrected, introduce 2 new coefficient Mid-MFCC and I-MFCC.Mid-MFCC and
I-MFCC has good computational accuracy in medium, high frequency region respectively, can be used as the supplement to low order MFCC, realizes to full frequency-domain
Spectrum signature calculated.Wherein, the filter of Mid-MFCC filters group is distributed more intensive in mid-frequency region, low
Frequently, high-frequency region is more sparse;I-MFCC is inverse Mel frequency cepstral coefficients, and the filter of the filter group of I-MFCC is in low frequency
Area distribution is sparse, and high-frequency region distribution is more intensive.
In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank
The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.In an embodiment of the present invention, the MFCC of 1~6 rank, 3~10
The Mid-MFCC of the rank and I-MFCC of 7~12 ranks is respectively adopted following formula and is calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC,
F indicates actual frequency.
Finally, the characteristic parameter of improved MFCC by this 20 rank Mel cepstrum parameters and Mel cepstrum coefficients (MFCC) one
Maximum value, minimum value, mean value and the variance composition of order difference.
Step S104:Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the voice signal
Corresponding feature vector sequence.
In specific implementation, when extracting the characteristic parameter of pretreated multiple frames, the characteristic parameter extracted is pressed
According to character pair vector sequence is sequentially combined into, to obtain the corresponding feature vector sequence of the voice signal.
Step S105:The corresponding feature vector sequence of the voice signal is trained using support vector machines (SVM)
And identification, obtain corresponding speech emotion recognition result.
In specific implementation, when obtaining the corresponding feature vector sequence of the voice signal, the voice can be believed
Number corresponding feature vector sequence is trained and is identified using support vector machines (SVM), to obtain corresponding speech emotional
Recognition result.
In an embodiment of the present invention, choose support vector machines core letter be radial basis function (RBF), it is used support to
Amount machine grader is the 5 class support vector machines graders of " one-vs-one " pattern.
Specifically, during Training Support Vector Machines, five kinds of emotions are identified, according to " one-vs-one " plan
10 support vector machine classifiers can be slightly built, are " indignation-fear ", " indignation-is sad ", " indignation-is neutral ", " anger respectively
Anger-happiness ", " fearing-sadness ", " fearing-neutrality ", " fearing-happiness ", " sad-neutral ", " sadness-happiness ", " neutrality-is high
It is emerging " grader.
Then, the training set sample number that each mood is arranged is 150, and test set sample number is 50, will be above-mentioned
The feature vector sequence that characteristic parameter composition is extracted in step is input to 10 support vector machine classifiers that training obtains.
Using speech-emotion recognition method and the speech-emotion recognition method institute in the prior art in the embodiment of the present invention
The Experimental comparison results of the recognition accuracy of emotion recognition are obtained, respectively as shown in the following table 1, table 2:
Table 1
Table 2
Pass through the comparison of above-mentioned table, it can be seen that the accurate knowledge of the speech-emotion recognition method in the embodiment of the present invention
It is not forthright to be obviously improved.
The above-mentioned speech-emotion recognition method in the embodiment of the present invention is described in detail, below will be to above-mentioned
The corresponding device of method is introduced.
Fig. 2 shows a kind of structures of speech emotion recognition device in the embodiment of the present invention.Participate in Fig. 2, described device
20 may include acquiring unit 201, pretreatment unit 202, parameter extraction unit 203 and recognition unit 204, wherein:
The acquiring unit 201 is suitable for obtaining pending voice signal.
Pretreatment unit 202 obtains pretreated voice letter suitable for being pre-processed to acquired voice signal
Number.
Parameter extraction unit 203 is suitable for extracting the characteristic parameter of pretreated voice signal;Using the feature extracted
Parameter group obtains the corresponding feature vector sequence of the voice signal at corresponding feature vector sequence;The characteristic parameter packet
Include short-time energy and its derivative parameter, fundamental frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to MFCC
The one of the maximum value of the Mel cepstrum coefficients for 20 ranks sought and the first-order difference of MFCC, the first-order difference minimum value of MFCC, MFCC
The variance of the mean value of order difference and the first-order difference of MFCC.
Recognition unit 204, suitable for being instructed to the corresponding feature vector sequence of the voice signal using support vector machines
Practice and identify, obtains corresponding speech emotion recognition result.
In specific implementation, the pretreatment unit 202, suitable for acquired voice signal is sampled and quantify,
Preemphasis, framing adding window, short-time energy analysis and end-point detection.
In specific implementation, the pretreatment unit 202, suitable for being traversed for the multiple frames obtained after pretreatment,
Obtain the present frame traversed;Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;When true
Surely the present frame and the short-time energy of the frame of continuous preset quantity thereafter traversed is all higher than or is equal to initial unvoiced segments voice
When the short-time energy of signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;When determination calculates
When the ratio arrived is greater than or equal to preset threshold value, determine that the present frame traversed is the voice start frame of the voice signal.
In an embodiment of the present invention, the short-time energy of the pretreated voice signal and its derivative parameter, including
The maximum value of short-time energy, the short-time energy of the multiple frames obtained after the pretreatment, the minimum value of short-time energy, short-time energy
Mean value, the variance of short-time energy, short-time energy shake, the linear regression coeffficient of short-time energy, short-time energy linear regression
Mean square error and 250Hz or less the short-time energy of coefficient account for the ratio of whole short-time energies;
In an embodiment of the present invention, the fundamental frequency of the pretreated voice signal and its derivative parameter, including
The fundamental frequency of the multiple frames obtained after the pretreatment, the maximum value of fundamental frequency, the minimum value of fundamental frequency, fundamental frequency
Mean value, the variance of fundamental frequency, the shake of single order fundamental frequency, the shake of second order fundamental frequency and meet F (i) * F (i+1)!=0
The corresponding voiced sound of adjacent two frame between differential pitch;Wherein, F (i) indicates that the fundamental frequency of the i-th frame, F (i+1) indicate i+1 frame
Fundamental frequency.
In an embodiment of the present invention, the sound quality characteristic resonances peak of the pretreated voice signal and its derivative ginseng
It counts, including each maximum of the first, second, and third formant frequency of unvoiced frame in the multiple frames obtained after the pretreatment
Value, the minimum value of the first, second, and third formant frequency, the mean value of the first, second, and third formant frequency, first,
Two and third formant frequency variance and the first, second, and third formant frequency single order shake, the second formant frequency
The maximum value of ratio and the second formant frequency ratio minimum value and the second formant frequency ratio mean value.
In an embodiment of the present invention, the Mel cepstrum coefficients of 20 ranks that MFCC is sought, including 1~6 rank
The I-MFCC of MFCC, the Mid-MFCC of 3~10 ranks and 7~12 ranks.
In an embodiment of the present invention, the parameter extraction unit 203, the formula suitable for being respectively adopted following are calculated
The I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC,
F indicates actual frequency
The embodiment of the present invention additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, described
The step of speech-emotion recognition method is executed when computer instruction is run.Wherein, the speech-emotion recognition method
The introduction for referring to preceding sections, repeats no more.
The embodiment of the present invention additionally provides a kind of terminal, including memory and processor, and energy is stored on the memory
Enough computer instructions run on the processor, the processor execute the voice when running the computer instruction
The step of emotion identification method.Wherein, the speech-emotion recognition method refers to the introduction of preceding sections, repeats no more.
Using the above method in the embodiment of the present invention,
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in computer readable storage medium, and storage is situated between
Matter may include:ROM, RAM, disk or CD etc..
Although present disclosure is as above, present invention is not limited to this.Any those skilled in the art are not departing from this
It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
Subject to the range of restriction.
Claims (10)
1. a kind of speech-emotion recognition method, which is characterized in that including:
Obtain pending voice signal;
Acquired voice signal is pre-processed, pretreated voice signal is obtained;
Extract the characteristic parameter of pretreated voice signal;The characteristic parameter includes short-time energy and its derivative parameter, base
Voice frequency and its derivative parameter, sound quality characteristic resonances peak and its derivative parameter, to the Mel cepstrum coefficients of MFCC 20 ranks sought and
The maximum value of the first-order difference of MFCC, the first-order difference minimum value of MFCC, the mean value of the first-order difference of MFCC and the single order of MFCC
The variance of difference;
Corresponding feature vector sequence is formed using the characteristic parameter extracted, obtains the corresponding characteristic vector of the voice signal
Sequence;
The corresponding feature vector sequence of the voice signal is trained and is identified using support vector machines, obtains corresponding language
Sound emotion recognition result.
2. speech-emotion recognition method according to claim 1, which is characterized in that it is described to acquired voice signal into
Row pretreatment, including:
Acquired voice signal is sampled and is quantified, preemphasis, framing adding window, short-time energy analysis and end-point detection.
3. speech-emotion recognition method according to claim 2, which is characterized in that carrying out endpoint to the voice signal
When detection, voice start frame is determined by the way of following:
The multiple frames obtained after pretreatment are traversed, the present frame traversed is obtained;
Calculate the present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter;
When the determining present frame traversed and the short-time energy of the frame of continuous preset quantity thereafter are all higher than or are equal to initial nothing
When the short-time energy of sound section voice signal, the ratio of the short-time energy between the present frame and next frame traversed is calculated;
When determining that the ratio being calculated is greater than or equal to preset threshold value, determine that the present frame traversed is believed for the voice
Number voice start frame.
4. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language
The short-time energy of sound signal and its derivative parameter, including short-time energy, the short-time energy of the multiple frames that are obtained after the pretreatment
Maximum value, the minimum value of short-time energy, the mean value of short-time energy, the variance of short-time energy, short-time energy shake, short-time energy
Linear regression coeffficient, short-time energy linear regression coeffficient mean square error and 250Hz or less short-time energies account for all in short-term
The ratio of energy.
5. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language
The fundamental frequency of sound signal and its derivative parameter, including fundamental frequency, the fundamental frequency of the multiple frames that are obtained after the pretreatment
Maximum value, the minimum value of fundamental frequency, the mean value of fundamental frequency, the variance of fundamental frequency, single order fundamental frequency shake, second order
F (i) * F (i+1) are shaken and met to fundamental frequency!Differential pitch between=0 corresponding voiced sound of adjacent two frame;Wherein, F (i) is indicated
The fundamental frequency of i-th frame, F (i+1) indicate the fundamental frequency of i+1 frame.
6. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that the pretreated language
Each unvoiced frame in the sound quality characteristic resonances peak of sound signal and its derivative parameter, including multiple frames for being obtained after the pretreatment
The maximum value of first, second, and third formant frequency, the minimum value of the first, second, and third formant frequency, first, second
With the mean value of third formant frequency, the variance of the first, second, and third formant frequency and the first, second, and third formant
Single order shake, the maximum value of the second formant frequency ratio and the second formant frequency ratio minimum value and the second resonance of frequency
Peak frequency ratio mean value.
7. according to claim 1-3 any one of them speech-emotion recognition methods, which is characterized in that described to be sought to MFCC
The Mel cepstrum coefficients of 20 ranks, include the I-MFCC of the MFCC of 1~6 rank, the Mid-MFCC of 3~10 ranks and 7~12 ranks.
8. speech-emotion recognition method according to claim 7, which is characterized in that the MFCC of 1~6 rank, 3~10 ranks
The I-MFCC of Mid-MFCC and 7~12 ranks are respectively adopted following formula and are calculated:
Wherein, fMelIndicate the frequency of MFCC, fMid-MelIndicate the frequency of Mid-MFCC, fI-MelIndicate the frequency of I-MFCC, f tables
Show actual frequency.
9. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that the computer instruction fortune
Perform claim requires the step of 1 to 8 any one of them speech-emotion recognition method when row.
10. a kind of terminal, which is characterized in that including memory and processor, being stored on the memory can be at the place
The computer instruction run on reason device, perform claim requires any one of 1 to 8 institute when the processor runs the computer instruction
The step of speech-emotion recognition method stated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810455163.7A CN108305639B (en) | 2018-05-11 | 2018-05-11 | Speech emotion recognition method, computer-readable storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810455163.7A CN108305639B (en) | 2018-05-11 | 2018-05-11 | Speech emotion recognition method, computer-readable storage medium and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108305639A true CN108305639A (en) | 2018-07-20 |
CN108305639B CN108305639B (en) | 2021-03-09 |
Family
ID=62846586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810455163.7A Active CN108305639B (en) | 2018-05-11 | 2018-05-11 | Speech emotion recognition method, computer-readable storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108305639B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243492A (en) * | 2018-10-28 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | A kind of speech emotion recognition system and recognition methods |
CN109300339A (en) * | 2018-11-19 | 2019-02-01 | 王泓懿 | A kind of exercising method and system of Oral English Practice |
CN109946055A (en) * | 2019-03-22 | 2019-06-28 | 武汉源海博创科技有限公司 | A kind of sliding rail of automobile seat abnormal sound detection method and system |
CN110491417A (en) * | 2019-08-09 | 2019-11-22 | 北京影谱科技股份有限公司 | Speech-emotion recognition method and device based on deep learning |
CN111243627A (en) * | 2020-01-13 | 2020-06-05 | 云知声智能科技股份有限公司 | Voice emotion recognition method and device |
CN111768800A (en) * | 2020-06-23 | 2020-10-13 | 中兴通讯股份有限公司 | Voice signal processing method, apparatus and storage medium |
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103578470A (en) * | 2012-08-09 | 2014-02-12 | 安徽科大讯飞信息科技股份有限公司 | Telephone recording data processing method and system |
US9495954B2 (en) * | 2010-08-06 | 2016-11-15 | At&T Intellectual Property I, L.P. | System and method of synthetic voice generation and modification |
CN106448659A (en) * | 2016-12-19 | 2017-02-22 | 广东工业大学 | Speech endpoint detection method based on short-time energy and fractal dimensions |
CN106887241A (en) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of voice signal detection method and device |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
-
2018
- 2018-05-11 CN CN201810455163.7A patent/CN108305639B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9495954B2 (en) * | 2010-08-06 | 2016-11-15 | At&T Intellectual Property I, L.P. | System and method of synthetic voice generation and modification |
CN103578470A (en) * | 2012-08-09 | 2014-02-12 | 安徽科大讯飞信息科技股份有限公司 | Telephone recording data processing method and system |
CN106887241A (en) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of voice signal detection method and device |
CN106448659A (en) * | 2016-12-19 | 2017-02-22 | 广东工业大学 | Speech endpoint detection method based on short-time energy and fractal dimensions |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
Non-Patent Citations (2)
Title |
---|
赵力等: "实用语音情感识别中的若干关键技术", 《数据采集与处理》 * |
韩一等: "基于MFCC的语音情感识别", 《重庆邮电大学学报(自然科学版)》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243492A (en) * | 2018-10-28 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | A kind of speech emotion recognition system and recognition methods |
CN109300339A (en) * | 2018-11-19 | 2019-02-01 | 王泓懿 | A kind of exercising method and system of Oral English Practice |
CN109946055A (en) * | 2019-03-22 | 2019-06-28 | 武汉源海博创科技有限公司 | A kind of sliding rail of automobile seat abnormal sound detection method and system |
CN109946055B (en) * | 2019-03-22 | 2021-01-12 | 宁波慧声智创科技有限公司 | Method and system for detecting abnormal sound of automobile seat slide rail |
CN110491417A (en) * | 2019-08-09 | 2019-11-22 | 北京影谱科技股份有限公司 | Speech-emotion recognition method and device based on deep learning |
CN111243627A (en) * | 2020-01-13 | 2020-06-05 | 云知声智能科技股份有限公司 | Voice emotion recognition method and device |
CN111243627B (en) * | 2020-01-13 | 2022-09-27 | 云知声智能科技股份有限公司 | Voice emotion recognition method and device |
CN111768800A (en) * | 2020-06-23 | 2020-10-13 | 中兴通讯股份有限公司 | Voice signal processing method, apparatus and storage medium |
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
CN113807249B (en) * | 2021-09-17 | 2024-01-12 | 广州大学 | Emotion recognition method, system, device and medium based on multi-mode feature fusion |
Also Published As
Publication number | Publication date |
---|---|
CN108305639B (en) | 2021-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108305639A (en) | Speech-emotion recognition method, computer readable storage medium, terminal | |
CN108682432A (en) | Speech emotion recognition device | |
CN107610715B (en) | Similarity calculation method based on multiple sound characteristics | |
Kulmer et al. | Phase estimation in single channel speech enhancement using phase decomposition | |
CN106486131A (en) | A kind of method and device of speech de-noising | |
CN104700843A (en) | Method and device for identifying ages | |
Hui et al. | Convolutional maxout neural networks for speech separation | |
CN109448726A (en) | A kind of method of adjustment and system of voice control accuracy rate | |
Mittal et al. | Study of characteristics of aperiodicity in Noh voices | |
Yuan | A time–frequency smoothing neural network for speech enhancement | |
CN110136709A (en) | Audio recognition method and video conferencing system based on speech recognition | |
Jiao et al. | Convex weighting criteria for speaking rate estimation | |
CN108288465A (en) | Intelligent sound cuts the method for axis, information data processing terminal, computer program | |
Archana et al. | Gender identification and performance analysis of speech signals | |
Morales-Cordovilla et al. | Feature extraction based on pitch-synchronous averaging for robust speech recognition | |
CN108269574A (en) | Voice signal processing method and device, storage medium and electronic equipment | |
Hanilçi et al. | Comparing spectrum estimators in speaker verification under additive noise degradation | |
Sinha et al. | On the use of pitch normalization for improving children's speech recognition | |
Huang et al. | DNN-based speech enhancement using MBE model | |
CN106356076A (en) | Method and device for detecting voice activity on basis of artificial intelligence | |
CN111755029B (en) | Voice processing method, device, storage medium and electronic equipment | |
Lu | Reduction of musical residual noise using block-and-directional-median filter adapted by harmonic properties | |
Jameel et al. | Noise robust formant frequency estimation method based on spectral model of repeated autocorrelation of speech | |
Thirumuru et al. | Application of non-negative frequency-weighted energy operator for vowel region detection | |
Shome et al. | Non-negative frequency-weighted energy-based speech quality estimation for different modes and quality of speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |