CN100543840C - Method for distinguishing speek person based on emotion migration rule and voice correction - Google Patents

Method for distinguishing speek person based on emotion migration rule and voice correction Download PDF

Info

Publication number
CN100543840C
CN100543840C CNB2005100619525A CN200510061952A CN100543840C CN 100543840 C CN100543840 C CN 100543840C CN B2005100619525 A CNB2005100619525 A CN B2005100619525A CN 200510061952 A CN200510061952 A CN 200510061952A CN 100543840 C CN100543840 C CN 100543840C
Authority
CN
China
Prior art keywords
voice
statement
fundamental frequency
emotion
neutral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100619525A
Other languages
Chinese (zh)
Other versions
CN1787074A (en
Inventor
吴朝晖
杨莹春
李东东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNB2005100619525A priority Critical patent/CN100543840C/en
Publication of CN1787074A publication Critical patent/CN1787074A/en
Application granted granted Critical
Publication of CN100543840C publication Critical patent/CN100543840C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present invention relates to a kind of method for distinguishing speek person based on emotion migration rule and voice correction, at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, the characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then.When the affective state of contrast phone is not neutrality, just can select for use the speech model that possesses corresponding emotion information to contrast.The effect that the present invention is useful is: in conjunction with phonetic feature correction and two kinds of methods of phonetic synthesis, make that the voice of gathering are consistent with the speech emotional state of contrast, improve the performance of Speaker Recognition System.

Description

Method for distinguishing speek person based on emotion migration rule and voice correction
Technical field
The present invention relates to signal Processing and area of pattern recognition, mainly is a kind of method for distinguishing speek person based on emotion migration rule and voice correction.
Background technology
Along with the arriving of 21 century of biology and infotech high development, biological witness's technology begins to show up prominently in the global electronic commercial affairs epoch as a kind of more convenient, advanced information security technology.Application on Voiceprint Recognition belongs to wherein a kind of, is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.
Than other biological identification technology, Application on Voiceprint Recognition, promptly Speaker Identification has the contact of need not, and easily accepts, and is easy to use, economical, accurate, waits and be applicable to the remote application advantage.But the performance of Application on Voiceprint Recognition also can have influence on the result who gathers with contrast along with the variation of speaker's oneself state (as emotion) except meeting is subjected to the influence of outside noise in actual applications.So the Application on Voiceprint Recognition system of strong robustness should take all factors into consideration speaker's physiology and the feature that behavior combines.The vocal print feature extraction be not only physiological characteristic in the voice signal, also comprise affective characteristics wherein, whole recognition system is discerned according to the feature that speaker's physiology and behavior combines, and puts in the past and has eliminated because emotion changes the unsettled hidden danger of being brought of Application on Voiceprint Recognition system performance.
Existing emotional speech Speaker Recognition System adds the speaker dependent in the past based on speaker's speech model of neutral voice emotional speech utilizes the voice under the various affective states of speaker to carry out modeling, to eliminate the influence that emotion changes.
This speaker's modeling method based on emotional speech requires to require the user to provide emotional speech simultaneously at the neutral voice of collection user.This emotional expression painstakingly often is difficult to obtain user's approval, has destroyed the original friendly of Speaker Identification.
Summary of the invention
The present invention will solve the existing defective of above-mentioned technology, and a kind of method for distinguishing speek person based on emotion migration rule and voice correction is provided.By analysis to phonetic feature under the different emotions state, realize neutral voice correction, enrich the emotion information in the voice, generate intermediateness voice with emotion information, make that the speech emotional state when gathering with contrast is consistent, thereby improve the performance of Speaker Identification.
The technical solution adopted for the present invention to solve the technical problems: this method for distinguishing speek person based on emotion migration rule and voice correction, at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, the characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then.When the affective state of contrast phone is not neutrality, just can select for use the speech model that possesses corresponding emotion information to contrast.
The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The described affective characteristics that compares is average fundamental frequency, fundamental frequency scope, pronunciation duration, mean intensity and strength range.The characteristic parameter that will revise in the described neutral voice carries out linear forecast coding analysis for after dividing frame with audio frequency to each frame, obtains linear forecast coding coefficient and balance information, and voice intensity.Described possess the intermediateness voice of emotion information for the method for utilizing linear predictive coding to synthesize according to the later neutral speech characteristic parameter of affective characteristics correction is synthesized the voice that obtain.The model of described speaker model for the Mel cepstrum feature coefficient modeling of extracting from the intermediateness voice that possess emotion information being obtained with gauss hybrid models.
The effect that the present invention is useful is: in conjunction with phonetic feature correction and two kinds of methods of phonetic synthesis, make that the voice of gathering are consistent with the speech emotional state of contrast, improve the performance of Speaker Recognition System.
Description of drawings
Fig. 1 is the method for distinguishing speek person system framework figure that changes based on the anti-emotion of emotion migration rule and voice correction of the present invention;
Fig. 2 is the algorithm flow chart of voice correction of the present invention;
Embodiment
The invention will be described further below in conjunction with drawings and Examples: method of the present invention was divided into for five steps.
The first step: audio frequency pre-service
The audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, three parts of pre-emphasis and windowing.
1, sample quantization
A), sound signal is carried out filtering, make its nyquist frequency F with sharp filter NBe 4KHZ;
B), audio sample rate F=2F is set N
C), to sound signal s a(t) sample by the cycle, obtain the amplitude sequence of digital audio and video signals s ( n ) = s a ( n F ) ;
D), s (n) is carried out quantization encoding, the quantization means s ' that obtains amplitude sequence (n) with pulse code modulation (pcm).
2, zero-suppress and float
A), calculate the mean value s of the amplitude sequence that quantizes;
B), each amplitude is deducted mean value, obtain zero-suppressing that to float back mean value be 0 amplitude sequence s " (n).
3, pre-emphasis
A), Z transfer function H (the z)=1-α z of digital filter is set -1In pre emphasis factor α, α desirable 1 or slightly little value than 1;
B), s " (n) by digital filter, obtain the suitable amplitude sequence s of high, medium and low frequency amplitude of sound signal " ' (n).
4, windowing
A), calculate frame length N (32 milliseconds) and the frame amount of the moving T (10 milliseconds) of audio frame, satisfied respectively:
N F = 0.032
T F = 0.010
Here F is an audio sample rate, and unit is Hz;
B), be that N, the frame amount of moving are T with the frame length, s " ' (n) be divided into a series of audio frame F m, each audio frame comprises N audio signal samples;
C), calculate the hamming code window function:
Figure C200510061952D00113
D), to each audio frame F mAdd hamming code window:
ω ( n ) × F m ( n ) ⇒ { F m ′ ( n ) | n = 0,1 , · · · , N - 1 } .
Second step: emotional speech feature extraction
The feature extraction of speech frame comprises the extraction of fundamental frequency (Pitch), linear forecast coding coefficient and residual signal.
1, the extraction of Pitch:
A), the hunting zone f of fundamental frequency is set Floor=50, f Ceiling=1250 (Hz);
B), the span f of the fundamental frequency of voice is set Min=50, f Max=550 (Hz);
C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).
D), calculate the SHR (subharmonic-harmonic wave ratio) of each frequency
SHR=SS/SH
Wherein SS = Σ n = 1 N X ( ( n - 1 / 2 ) f ) , SH = Σ n = 1 N X ( nf ) , N=f ceiling/f
E), find out the highest frequency f of SHR 1
F) if f 1F MaxPerhaps f 1SS-SH<0, think non-voice or quiet frame so, do not have fundamental frequency, Pitch=0
G), at [1.9375f 1, 2.0625f 1] the interval seek the frequency f of the local maximum of SHR 2
H) if f 2F Max, perhaps f 2SHR 0.2, Pitch=f 1
I), other situations, Pitch=f 2
J), the fundamental frequency that obtains is carried out the auto-correlation effect:
From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C<0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0.
K), at last whole Pitch values is carried out median smoothing filtering.
2, linear forecast coding coefficient (LPCC):
The exponent number p of linear predictive coding (LPC) A), is set;
B), calculate p rank LPC coefficient { a i(i=1,2 ..., p), by stepping type:
R i = Σ n = i N - 1 s ( n ) s ( n - i )
E 0=R 0
Figure C200510061952D00122
a i ( i ) = k i
a j ( i ) = a j ( i - 1 ) + k i a i - j ( i - 1 ) ∨1≤j≤i-1
E i = ( 1 - k i 2 ) E i - 1
∨i=1,2,...,p
a j = a j ( p ) ∨1≤j≤p
Promptly can be regarded as { a i, R wherein iBe autocorrelation function.
3, residual signal:
u ( n ) = [ s ( n ) - Σ i = 1 p a i s ( n - i ) ] / G
Wherein G is a drive factor.
The 3rd step, emotional feature analysis
Emotional feature analysis comprises the analysis of the variation range of the average of average fundamental frequency, fundamental frequency scope, pronunciation duration, intensity and intensity.
1, average fundamental frequency calculates and mutation analysis
A), the calculating of average fundamental frequency;
P mean = Σ i = 1 f P i f
Wherein, P MeanBe the average fundamental frequency of a statement, P iBe the pitch value of each frame, f is the number of speech frames in the statement
B), the Changing Pattern of average fundamental frequency is meant the difference of the average fundamental frequency of emotional speech and neutral voice:
AP=P mean-e-P mean-n
Wherein, AP is the Changing Pattern of average fundamental frequency, P Mean-eAnd P Mean-nIt is respectively the average fundamental frequency of emotion statement and corresponding neutral statement.
2, fundamental frequency range computation and mutation analysis
A), the calculating of fundamental frequency scope;
R=P max-P min
Wherein, R is the fundamental frequency scope of a statement, P MaxBe the maximal value of fundamental tone in the statement, P MinIt is the minimum value of fundamental tone in the statement.
B), the Changing Pattern of fundamental frequency scope is meant the quotient of emotional speech than the fundamental frequency scope of neutral voice:
PR=R e/R n
Wherein, PR is the Changing Pattern of fundamental frequency scope, R eAnd R nIt is respectively the fundamental frequency scope of emotion statement and corresponding neutral statement.
3, the duration of pronunciation calculates and mutation analysis
A), the statement pronunciation duration is each statement duration from start to end.The method of determining the position employing of beginning in short and end is: speech energy and predefined energy threshold are made comparisons.When speech energy surpasses this threshold values, and all be higher than this threshold values, indicate beginning in short at ensuing continuous several frames.When being lower than this threshold values, the continuous several frames of speech energy represent end in short.According to this definition, weigh the pronunciation duration of statement with the frame number of each statement of determining in the first step;
B), the variation of pronunciation duration obtained according to the duration of emotion statement and the ratio of corresponding neutral statement duration:
D=f e/f n
Wherein, D is the Changing Pattern of pronunciation duration, f eAnd f nIt is respectively the number of speech frames of emotion statement and corresponding neutral statement.
4, the calculating of mean intensity and mutation analysis
A), the calculating of mean intensity A);
T mean = Σ i = 1 K T i K
Wherein, T MeanBe the mean intensity of a statement, P iBe the value of each sampled point, K is the sampling number in the statement.
B), the Changing Pattern of mean intensity is meant the difference of the mean intensity of emotional speech and neutral voice:
AT=T mean-e-T mean-n
Wherein, AP is the Changing Pattern of mean intensity, T Mean-eAnd T Mean-nIt is respectively the mean intensity of emotion statement and corresponding neutral statement.
5, the calculating of strength range and mutation analysis
A), the calculating of strength range;
TR=R max-R min
Wherein, TR is the strength range of a statement, R MaxBe the maximal value of intensity in the statement, R MinIt is the minimum value of intensity in the statement.
B), the Changing Pattern of strength range is meant the quotient of emotional speech than the strength range of neutral voice:
TRC=TR e/TR n
Wherein, TRC is the Changing Pattern of strength range, TR eAnd TR nIt is respectively the strength range of emotion statement and corresponding neutral statement.
The 4th step, voice correction and intermediateness phonetic synthesis
Obtain after the Changing Pattern of neutral voice and emotional speech, just can the parameter of neutral voice be revised, utilize these new parameters to obtain possessing the intermediateness voice of emotion information simultaneously by the change information that obtains.
1, revises the duration of neutral voice
Reach the effect that changes the voice duration by the alkali that adds to the frame number of neutral voice.
The D value rounded be K, will prolong the voice duration, make number of speech frames increase if D, then illustrates the variation of emotion greater than 1.In order to imitate emotional speech, every K frame is imitated the K+1 frame at the end with the K frame, K+1 frame is originally postponed and is become the K+2 frame.If D, then illustrates the variation of emotion less than 1 and will shorten the voice duration, make number of speech frames reduce.Leave out the last frame K frame of every K frame, K+1 frame originally becomes the K frame.
2, revise the fundamental frequency of neutral voice
u m=(u+AP)*PR
Wherein, u mBe the fundamental frequency of revised neutral voice, u is process revised neutral speech pitch of duration, and AP is the Changing Pattern of average fundamental frequency, and PR is the Changing Pattern of fundamental frequency scope.
3, synthetic mesophase state voice
Similarly, according to revised linear forecast coding coefficient and balance information, utilize the synthetic intermediateness voice that obtain of predictive coding.
s ( n ) = Gu m ( n ) + Σ i = 1 p a i s ( n - i )
4, revise the intensity of the voice of intermediateness
At last, to obtaining voice and carry out the correction of voice intensity, obtain possessing the intermediateness voice of emotion information through predictive coding is synthetic.
T m=(T+AT)*TRC
Wherein, T mBe the intensity of revised neutral voice, u is the synthetic intensity that obtains voice of predictive coding, and AT is the Changing Pattern of mean intensity, and TRC is the Changing Pattern of strength range.
The 5th step, Speaker Identification
After obtaining the intermediateness voice that possess emotion information, it is extracted the Mel cepstrum feature, and adopt gauss hybrid models (GMM) to carry out Speaker Identification.Each user is set up a gauss hybrid models, need train everyone model parameter.The voice signal (intermediateness voice, test tone) of input at first will carry out feature extraction.Speaker Identification is divided into feature extraction, model training, three parts of identification.
1, the extraction of MFCC:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , . . . , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 ln 10 - 1 , ( i = 0,1,2 , . . . , p )
E), calculate each Mel territory passage φ jOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 .
F), be discrete cosine transform DCT
2, GMM model training
Each speaker's phonetic feature has all formed specific distribution in feature space, can describe speaker's individual character with this distribution.Gauss hybrid models (GMM) is the characteristic distribution with the approximate speaker of linear combination of a plurality of Gaussian distribution.
The functional form of each speaker's probability density function is identical, the parameter in the different just functions.M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:
p ( x ) = Σ i = 1 M P i b i ( x )
b i ( x ) = N ( x , u i , R i )
= 1 ( 2 π ) p / 2 | p i | 1 / 2 exp { - 1 2 ( x - u i ) T R i - 1 ( x - u i ) }
Wherein, p is the dimension of feature, b i(x) being kernel function, is that mean value vector is u i, covariance matrix is R iGauss of distribution function, M (optional, as to be generally 16,32) is the exponent number of GMM model, is set at one in the past and determines integer setting up speaker model. λ = Δ { P i , u i , R i | i = 1,2 , . . . , M } Be the parameter among the speaker characteristic distribution GMM.As the weighting coefficient that Gaussian Mixture distributes, P iShould satisfy feasible:
∫ - ∞ + ∞ p ( x / λ ) dx = 1
Because the p (x) that calculates among the GMM need ask p * p dimension square formation R i(i=1,2 ..., M) contrary, operand is big.For this reason, with R jBe made as diagonal matrix, inversion operation be converted into ask computing reciprocal, improve arithmetic speed.
3, identification
After the user speech input,, obtain a characteristic vector sequence through feature extraction.This sequence is input among the GMM of relevant user model parameter, obtains similarity value s.Getting the pairing user of GMM model who generates maximum s value is identification person.
Experimental result
Native system is tested on Emotional Prosody Speech sound bank.This sound bank is the emotional speech database of being set up according to database standard by interlinguistics data alliance, pronunciation character research as the different emotions voice, record by 7 professional performers (3 male target speakers and 4 women's target speakers), read aloud a series of specific statements that give in English, mainly be date and numeral, contained 14 kinds of different emotions types.The method of recording is the different tone, intonation and the word speed when allowing the corresponding emotion of actor, each speaker does not wait in the record length of every kind of emotion, between 10 seconds to 40 seconds, also have only a few to reach 50 seconds greatly, the total record length of each speaker is greatly about 5,6 minutes.
We simultaneously on this storehouse with traditional method for distinguishing speek person (Baseline) with add linear forecast coding analysis and synthesize the method for distinguishing speek person (Unmodified LPC) that does not still carry out the feature correction and carried out same experiment, be used for and native system (Modified LPC) compares.These two kinds of methods all are to utilize neutral voice that the speaker is carried out modeling, the priori to any emotion of no use.
The method for distinguishing speek person of traditional no any processing is based on the first step and the 6th step of this explanation.After neutral voice are carried out pre-service, it is extracted the Mel cepstrum feature, utilize gauss hybrid models speaker's modeling.Same, the emotional speech of test carries out rightly with the speaker model of building up after the process of extracting through pre-service and Mel cepstrum feature, obtains that the highest model of branch is pairing artificially says other person in a minute.
Add linear forecast coding analysis and still do not carry out the method for distinguishing speek person of feature correction on the basis of traditional method for distinguishing speek person with synthetic, after the voice pre-service, carry out linear forecast coding analysis, initial linear predictive coding coefficient and the residual signal (not doing under the situation of any correction) that utilizes analysis to obtain synthesizes afterwards, again voice are extracted the Mel cepstrum feature, utilize gauss hybrid models afterwards speaker's modeling.Same, the emotional speech of test is through after the pre-service, also many linear one-step prediction Coded Analysis and synthetic processes.
We are to the recognition result assessment of 14 kinds of emotion test voice to the speaker model of neutral voice.
Experimental result is as follows:
Figure C200510061952D00181
Wherein, the traditional method for distinguishing speek person of " Baseline " expression, " Unmodified LPC " expression adds linear forecast coding analysis and the synthetic method for distinguishing speek person that does not still carry out the feature correction, " Modified LPC " expression method that native system proposed." IR " expression speaker's recognition correct rate promptly under the user be legal situation, compares all validated users in applicant and the database, thereby provides similar users, if the applicant is same individual with the user who provides, then discerns correctly.
Experimental result shows, the method for distinguishing speek person of traditional no any processing is under gathering voice and the corresponding to situation of tested speech affective state (when being neutral voice), can reach discrimination preferably, but when the test emotion changed, performance descended sharply.
Before voice are extracted the Mel cepstrum feature, carry out linear forecast coding analysis and synthetic, can not lose the distinctive feature of speaker.
This recognizer can according to the Changing Pattern of voice, be revised neutral voice by to the identical emotional speech of text and the analysis of neutral voice, emotion information is joined in the neutral voice go.This algorithm can not need specific speaker's emotional speech according to the priori of emotion, and has improved the discrimination of Speaker Identification, so strengthened the robustness of Speaker Recognition System under speaker's emotion fluctuation situation.

Claims (10)

1, a kind of method for distinguishing speek person based on emotion migration rule and voice correction, it is characterized in that: at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then obtains possessing the intermediateness voice of emotion information; When the affective state of contrast phone is not neutrality, sets up speaker model with the intermediateness voice that possess corresponding emotion information and compare.
2, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1, it is characterized in that: the characteristic parameter that will revise in the described neutral voice is for to divide the laggard line linearity predictive coding of frame to analyze resulting linear forecast coding coefficient, balance information and voice intensity audio frequency.
3, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1 is characterized in that: described possess the intermediateness voice of emotion information for the method for utilizing linear predictive coding to synthesize according to the later neutral speech characteristic parameter of affective characteristics correction is synthesized the voice that obtain.
4, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1 is characterized in that: the model of described speaker model for gauss hybrid models the Mel cepstrum feature coefficient modeling of extracting from the intermediateness voice that possess emotion information being obtained.
5, according to claim 1 or 2 or 3 or 4 described method for distinguishing speek person, it is characterized in that: the key step of this method based on emotion migration rule and voice correction:
5.1), the audio frequency pre-service: the audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, four parts of pre-emphasis and windowing;
5.2), the emotional speech feature extraction: the feature extraction of speech frame comprises the extraction of fundamental frequency, linear forecast coding coefficient and residual signal;
5.3), emotional feature analysis: comprise the analysis of the variation range of the average of average fundamental frequency, fundamental frequency scope, pronunciation duration, intensity and intensity;
5.3.1), average fundamental frequency calculates and mutation analysis
A), the calculating of average fundamental frequency;
P mean = Σ i = 1 f P i f
Wherein, P MeanBe the average fundamental frequency of a statement, P iBe the pitch value of each frame, f is the number of speech frames B in the statement), the Changing Pattern of average fundamental frequency is meant the difference of the average fundamental frequency of emotional speech and neutral voice:
AP=P mean-e-P mean-n
Wherein, AP is the Changing Pattern of average fundamental frequency, P Mean-eAnd P Mean-nIt is respectively the average fundamental frequency of emotion statement and corresponding neutral statement.
5.3.2), fundamental frequency range computation and mutation analysis
A), the calculating of fundamental frequency scope;
R=P max-P min
Wherein, R is the fundamental frequency scope of a statement, P MaxBe the maximal value of fundamental tone in the statement, P MinIt is the minimum value of fundamental tone in the statement.
B), the Changing Pattern of fundamental frequency scope is meant the quotient of emotional speech than the fundamental frequency scope of neutral voice:
PR=R e/R n
Wherein, PR is the Changing Pattern of fundamental frequency scope, R eAnd R nIt is respectively the fundamental frequency scope of emotion statement and corresponding neutral statement.
5.3.3), duration of pronunciation calculates and mutation analysis
A), the statement pronunciation is each statement duration from start to end duration, by step 5.1) in calculate each statement frame number determine pronunciation duration of statement;
B), the variation of pronunciation duration obtained according to the duration of emotion statement and the ratio of corresponding neutral statement duration: D=f e/ f n
Wherein, D is the Changing Pattern of pronunciation duration, f eAnd f nIt is respectively the number of speech frames of emotion statement and corresponding neutral statement;
5.3.4), the calculating and the mutation analysis of mean intensity
A), the calculating of mean intensity;
P mean = Σ i = 1 K T i K
Wherein, T MeanBe the mean intensity of a statement, P iBe the value of each sampled point, K is the sampling number in the statement.
B), the Changing Pattern of mean intensity is meant the difference of the mean intensity of emotional speech and neutral voice:
AT=T mean-e-T mean-n
Wherein, AP is the Changing Pattern of mean intensity, T Mean-eAnd T Mean-nIt is respectively the mean intensity of emotion statement and corresponding neutral statement.
5.3.5), the calculating and the mutation analysis of strength range
A), the calculating of strength range;
TR=R max-R min
Wherein, TR is the strength range of a statement, R MaxBe the maximal value of intensity in the statement, R MinIt is the minimum value of intensity in the statement;
B), the Changing Pattern of strength range is meant the quotient of emotional speech than the strength range of neutral voice:
TRC=TR e/TR n
Wherein, TRC is the Changing Pattern of strength range, TR eAnd TR nIt is respectively the strength range of emotion statement and corresponding neutral statement.
5.4), voice correction and intermediateness phonetic synthesis:
Obtain after the Changing Pattern of neutral voice and emotional speech,, the parameter of neutral voice is revised, utilize these new parameters to obtain possessing the intermediateness voice of emotion information simultaneously by the change information that obtains;
5.5), Speaker Identification
After obtaining the intermediateness voice that possess emotion information, it is extracted the Mel cepstrum feature, and adopt gauss hybrid models to carry out Speaker Identification, and each user is set up a gauss hybrid models, everyone model parameter is trained, the voice signal of input, be intermediateness voice and test tone, at first will carry out feature extraction, Speaker Identification is divided into feature extraction, model training, three parts of identification.
6, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: described emotional speech feature extraction is specially:
6.1), the extraction of Pitch (fundamental frequency):
A), the hunting zone f of fundamental frequency is set Floor=50 (Hz), f Ceiling=1250 (Hz);
B), the span f of the fundamental frequency of voice is set Min=50 (Hz), f Max=550 (Hz);
C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);
D), calculate the subharmonic-harmonic wave ratio of each frequency
SHR=SS/SH
Wherein SS = Σ n = 1 N X ( ( n - 1 / 2 ) f ) , SH = Σ n = 1 N X ( nf ) , N=f ceiling/f
E), find out the highest frequency f of SHR 1
F) if f 1F MaxPerhaps f 1SS-SH<0, think non-voice or quiet frame so, do not have fundamental frequency, Pitch=0
G), at [1.9375f 1, 2.0625f 1] the interval seek the frequency f of the local maximum of SHR 2
H) if f 2F Max, perhaps f 2SHR 0.2, Pitch=f 1
I), other situations, Pitch=f 2
J), the fundamental frequency that obtains is carried out the auto-correlation effect:
From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C<0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0;
K), at last whole Pitch values is carried out median smoothing filtering;
6.2), linear forecast coding coefficient:
A), the exponent number p of linear predictive coding is set;
B), calculate p rank LPC coefficient { a i(i=1,2 ..., p), by stepping type:
R i = Σ n = i N - 1 s ( n ) s ( n - i )
E 0=R 0
Figure C200510061952C00054
a i ( i ) = k i
Figure C200510061952C00062
E i = ( 1 - k i 2 ) E i - 1
Figure C200510061952C00064
a j = a j ( p )
Figure C200510061952C00066
Promptly can be regarded as { a i, R wherein iBe autocorrelation function, K iBe partial correlation (Parcor) coefficient, E iBe i rank optimum linear prediction inverse filtering remainder energy;
6.3), residual signal:
u ( n ) = [ s ( n ) - Σ i = 1 p a i s ( n - i ) ] / G ;
Wherein G is a drive factor.
7, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: described Speaker Identification concrete steps are:
7.1), MFCC, the i.e. extraction of Mel cepstrum coefficient:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , . . . , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 ln 10 - 1 , ( i = 0,1,2 , . . . , p )
E), calculate each Mel territory passage φ jOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2 ,
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 ;
F), be discrete cosine transform DCT;
7.2), the GMM model training:
M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:
b i ( x ) = N ( x , u i , R i )
p ( x ) = Σ i = 1 M P i b i ( x ) ,
= 1 ( 2 π ) p / 2 | p i | 1 / 2 exp { - 1 2 ( x - u i ) T R i - 1 ( x - u i ) }
Wherein, p is the dimension of feature, b i(x) being kernel function, is that mean value vector is u i, covariance matrix is R iGauss of distribution function; Be the exponent number of GMM model, be set at one in the past and determine integer setting up speaker model; λ = Δ { P i , u i , R i | i = 1,2 , . . . , M } Be the parameter among the speaker characteristic distribution GMM, as the weighting coefficient that Gaussian Mixture distributes, P iShould satisfy feasible: ∫ - ∞ + ∞ p ( x / λ ) dx = 1 ;
7.3), identification:
After the user speech input,, obtain a characteristic vector sequence through feature extraction; This sequence is input among the GMM of relevant user model parameter, obtains similarity value s; Getting the pairing user of GMM model who generates maximum s value is identification person.
8, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: the method for determining the position employing of beginning in short and end is: speech energy and predefined energy threshold are made comparisons, when speech energy surpasses this threshold values, and all be higher than this threshold values at ensuing continuous several frames, indicate beginning in short, when the continuous several frames of speech energy are lower than this threshold values, represent end in short.
9, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: concrete steps are as follows in voice correction and intermediateness phonetic synthesis:
9.1), revise duration of neutral voice; Change the voice duration by the frame number that increases or reduce neutral voice;
The D value rounded be K, will prolong the voice duration, make number of speech frames increase if D, then illustrates the variation of emotion greater than 1; In order to imitate emotional speech, every K frame is imitated the K+1 frame at the end with the K frame, K+1 frame is originally postponed and is become the K+2 frame; If D, then illustrates the variation of emotion less than 1 and will shorten the voice duration, make number of speech frames reduce, leave out the last frame K frame of every K frame, K+1 frame originally becomes the K frame;
9.2), revise the fundamental frequency of neutral voice
u m=(u+AP)*PR;
Wherein, u mBe the fundamental frequency of revised neutral voice, u is process revised neutral speech pitch of duration, and AP is the Changing Pattern of average fundamental frequency, and PR is the Changing Pattern of fundamental frequency scope;
9.3), synthetic mesophase state voice
Similarly, according to revised linear forecast coding coefficient and balance information, utilize the synthetic intermediateness voice that obtain of predictive coding; s ( n ) = Gu m ( n ) + Σ i = 1 p a i s ( n - i )
9.4), revise the intensity of the voice of intermediateness
At last, to obtaining voice and carry out the correction of voice intensity, obtain possessing the intermediateness voice of emotion information through predictive coding is synthetic; T m=(T+AT) * TRC
Wherein, T mBe the intensity of revised neutral voice, u is the synthetic intensity that obtains voice of predictive coding, and AT is the Changing Pattern of mean intensity, and TRC is the Changing Pattern of strength range.
CNB2005100619525A 2005-12-13 2005-12-13 Method for distinguishing speek person based on emotion migration rule and voice correction Expired - Fee Related CN100543840C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100619525A CN100543840C (en) 2005-12-13 2005-12-13 Method for distinguishing speek person based on emotion migration rule and voice correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100619525A CN100543840C (en) 2005-12-13 2005-12-13 Method for distinguishing speek person based on emotion migration rule and voice correction

Publications (2)

Publication Number Publication Date
CN1787074A CN1787074A (en) 2006-06-14
CN100543840C true CN100543840C (en) 2009-09-23

Family

ID=36784492

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100619525A Expired - Fee Related CN100543840C (en) 2005-12-13 2005-12-13 Method for distinguishing speek person based on emotion migration rule and voice correction

Country Status (1)

Country Link
CN (1) CN100543840C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226742B (en) * 2007-12-05 2011-01-26 浙江大学 Method for recognizing sound-groove based on affection compensation
CN102655002B (en) * 2011-03-01 2013-11-27 株式会社理光 Audio processing method and audio processing equipment
CN102332263B (en) * 2011-09-23 2012-11-07 浙江大学 Close neighbor principle based speaker recognition method for synthesizing emotional model
CN105374357B (en) * 2015-11-23 2022-03-29 青岛海尔智能技术研发有限公司 Voice recognition method and device and voice control system
CN105554281A (en) * 2015-12-21 2016-05-04 联想(北京)有限公司 Information processing method and electronic device
CN107516511B (en) 2016-06-13 2021-05-25 微软技术许可有限责任公司 Text-to-speech learning system for intent recognition and emotion
CN107221344A (en) * 2017-04-07 2017-09-29 南京邮电大学 A kind of speech emotional moving method
CN109102810B (en) * 2017-06-21 2021-10-15 北京搜狗科技发展有限公司 Voiceprint recognition method and device
CN111274807B (en) * 2020-02-03 2022-05-10 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN1787074A (en) 2006-06-14

Similar Documents

Publication Publication Date Title
CN100543840C (en) Method for distinguishing speek person based on emotion migration rule and voice correction
US11322155B2 (en) Method and apparatus for establishing voiceprint model, computer device, and storage medium
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
Kinnunen Spectral features for automatic text-independent speaker recognition
CN101178897B (en) Speaking man recognizing method using base frequency envelope to eliminate emotion voice
CN100440315C (en) Speaker recognition method based on MFCC linear emotion compensation
CN101944359B (en) Voice recognition method facing specific crowd
CN103617799B (en) A kind of English statement pronunciation quality detection method being adapted to mobile device
CN104700843A (en) Method and device for identifying ages
CN101226743A (en) Method for recognizing speaker based on conversion of neutral and affection sound-groove model
CN101923855A (en) Test-irrelevant voice print identifying system
CN102655003B (en) Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN110265063B (en) Lie detection method based on fixed duration speech emotion recognition sequence analysis
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
CN101419800B (en) Emotional speaker recognition method based on frequency spectrum translation
Nidhyananthan et al. Language and text-independent speaker identification system using GMM
Kim Singing voice analysis/synthesis
Baghel et al. Exploration of excitation source information for shouted and normal speech classification
Kumar et al. Text dependent speaker identification in noisy environment
CN114999468A (en) Speech feature-based speech recognition algorithm and device for aphasia patients
Razak et al. Towards automatic recognition of emotion in speech
Zheng et al. The Extraction Method of Emotional Feature Based on Children's Spoken Speech
Shi et al. Study about Chinese speech synthesis algorithm and acoustic model based on wireless communication network
Shi et al. Research Article Study about Chinese Speech Synthesis Algorithm and Acoustic Model Based on Wireless Communication Network
Liu et al. Study about Chinese Speech Synthesis Algorithm and Acoustic Model Based on Wireless Communication Network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Wu Chaohui

Inventor after: Yang Yingchun

Inventor after: Li Dongdong

Inventor before: Wu Chaohui

Inventor before: Yang Yingchun

Inventor before: Li Dongdong

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090923

Termination date: 20171213