CN100543840C

CN100543840C - Method for distinguishing speek person based on emotion migration rule and voice correction

Info

Publication number: CN100543840C
Application number: CNB2005100619525A
Authority: CN
Inventors: 吴朝晖; 杨莹春; 李东东
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2005-12-13
Filing date: 2005-12-13
Publication date: 2009-09-23
Anticipated expiration: 2025-12-13
Also published as: CN1787074A

Abstract

The present invention relates to a kind of method for distinguishing speek person based on emotion migration rule and voice correction, at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, the characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then.When the affective state of contrast phone is not neutrality, just can select for use the speech model that possesses corresponding emotion information to contrast.The effect that the present invention is useful is: in conjunction with phonetic feature correction and two kinds of methods of phonetic synthesis, make that the voice of gathering are consistent with the speech emotional state of contrast, improve the performance of Speaker Recognition System.

Description

Method for distinguishing speek person based on emotion migration rule and voice correction

Technical field

The present invention relates to signal Processing and area of pattern recognition, mainly is a kind of method for distinguishing speek person based on emotion migration rule and voice correction.

Background technology

Along with the arriving of 21 century of biology and infotech high development, biological witness's technology begins to show up prominently in the global electronic commercial affairs epoch as a kind of more convenient, advanced information security technology.Application on Voiceprint Recognition belongs to wherein a kind of, is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.

Than other biological identification technology, Application on Voiceprint Recognition, promptly Speaker Identification has the contact of need not, and easily accepts, and is easy to use, economical, accurate, waits and be applicable to the remote application advantage.But the performance of Application on Voiceprint Recognition also can have influence on the result who gathers with contrast along with the variation of speaker's oneself state (as emotion) except meeting is subjected to the influence of outside noise in actual applications.So the Application on Voiceprint Recognition system of strong robustness should take all factors into consideration speaker's physiology and the feature that behavior combines.The vocal print feature extraction be not only physiological characteristic in the voice signal, also comprise affective characteristics wherein, whole recognition system is discerned according to the feature that speaker's physiology and behavior combines, and puts in the past and has eliminated because emotion changes the unsettled hidden danger of being brought of Application on Voiceprint Recognition system performance.

Existing emotional speech Speaker Recognition System adds the speaker dependent in the past based on speaker's speech model of neutral voice emotional speech utilizes the voice under the various affective states of speaker to carry out modeling, to eliminate the influence that emotion changes.

This speaker's modeling method based on emotional speech requires to require the user to provide emotional speech simultaneously at the neutral voice of collection user.This emotional expression painstakingly often is difficult to obtain user's approval, has destroyed the original friendly of Speaker Identification.

Summary of the invention

The present invention will solve the existing defective of above-mentioned technology, and a kind of method for distinguishing speek person based on emotion migration rule and voice correction is provided.By analysis to phonetic feature under the different emotions state, realize neutral voice correction, enrich the emotion information in the voice, generate intermediateness voice with emotion information, make that the speech emotional state when gathering with contrast is consistent, thereby improve the performance of Speaker Identification.

The technical solution adopted for the present invention to solve the technical problems: this method for distinguishing speek person based on emotion migration rule and voice correction, at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, the characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then.When the affective state of contrast phone is not neutrality, just can select for use the speech model that possesses corresponding emotion information to contrast.

The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The described affective characteristics that compares is average fundamental frequency, fundamental frequency scope, pronunciation duration, mean intensity and strength range.The characteristic parameter that will revise in the described neutral voice carries out linear forecast coding analysis for after dividing frame with audio frequency to each frame, obtains linear forecast coding coefficient and balance information, and voice intensity.Described possess the intermediateness voice of emotion information for the method for utilizing linear predictive coding to synthesize according to the later neutral speech characteristic parameter of affective characteristics correction is synthesized the voice that obtain.The model of described speaker model for the Mel cepstrum feature coefficient modeling of extracting from the intermediateness voice that possess emotion information being obtained with gauss hybrid models.

The effect that the present invention is useful is: in conjunction with phonetic feature correction and two kinds of methods of phonetic synthesis, make that the voice of gathering are consistent with the speech emotional state of contrast, improve the performance of Speaker Recognition System.

Description of drawings

Fig. 1 is the method for distinguishing speek person system framework figure that changes based on the anti-emotion of emotion migration rule and voice correction of the present invention;

Fig. 2 is the algorithm flow chart of voice correction of the present invention;

Embodiment

The invention will be described further below in conjunction with drawings and Examples: method of the present invention was divided into for five steps.

The first step: audio frequency pre-service

The audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, three parts of pre-emphasis and windowing.

1, sample quantization

A), sound signal is carried out filtering, make its nyquist frequency F with sharp filter _NBe 4KHZ;

B), audio sample rate F=2F is set _N

C), to sound signal s _a(t) sample by the cycle, obtain the amplitude sequence of digital audio and video signals

s (n) = s_{a} (\frac{n}{F});

D), s (n) is carried out quantization encoding, the quantization means s ' that obtains amplitude sequence (n) with pulse code modulation (pcm).

2, zero-suppress and float

A), calculate the mean value s of the amplitude sequence that quantizes;

B), each amplitude is deducted mean value, obtain zero-suppressing that to float back mean value be 0 amplitude sequence s " (n).

3, pre-emphasis

A), Z transfer function H (the z)=1-α z of digital filter is set ^-1In pre emphasis factor α, α desirable 1 or slightly little value than 1;

B), s " (n) by digital filter, obtain the suitable amplitude sequence s of high, medium and low frequency amplitude of sound signal " ' (n).

4, windowing

A), calculate frame length N (32 milliseconds) and the frame amount of the moving T (10 milliseconds) of audio frame, satisfied respectively:

\frac{N}{F} = 0.032

\frac{T}{F} = 0.010

Here F is an audio sample rate, and unit is Hz;

B), be that N, the frame amount of moving are T with the frame length, s " ' (n) be divided into a series of audio frame F _m, each audio frame comprises N audio signal samples;

C), calculate the hamming code window function:

D), to each audio frame F _mAdd hamming code window:

ω (n) \times F_{m} (n) &DoubleRightArrow; {F_{m}^{'} (n) | n = 0,1, \cdot \cdot \cdot, N - 1} .

Second step: emotional speech feature extraction

The feature extraction of speech frame comprises the extraction of fundamental frequency (Pitch), linear forecast coding coefficient and residual signal.

1, the extraction of Pitch:

A), the hunting zone f of fundamental frequency is set _Floor=50, f _Ceiling=1250 (Hz);

B), the span f of the fundamental frequency of voice is set _Min=50, f _Max=550 (Hz);

C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).

D), calculate the SHR (subharmonic-harmonic wave ratio) of each frequency

SHR＝SS/SH

Wherein

SS = Σ_{n = 1}^{N} X ((n - 1 / 2) f),

SH = Σ_{n = 1}^{N} X (nf),

N＝f _ceiling/f

E), find out the highest frequency f of SHR ₁

F) if f ₁F _MaxPerhaps f ₁SS-SH＜0, think non-voice or quiet frame so, do not have fundamental frequency, Pitch=0

G), at [1.9375f ₁, 2.0625f ₁] the interval seek the frequency f of the local maximum of SHR ₂

H) if f ₂F _Max, perhaps f ₂SHR 0.2, Pitch=f ₁

I), other situations, Pitch=f ₂

J), the fundamental frequency that obtains is carried out the auto-correlation effect:

From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C＜0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0.

K), at last whole Pitch values is carried out median smoothing filtering.

2, linear forecast coding coefficient (LPCC):

The exponent number p of linear predictive coding (LPC) A), is set;

B), calculate p rank LPC coefficient { a _i(i=1,2 ..., p), by stepping type:

R_{i} = Σ_{n = i}^{N - 1} s (n) s (n - i)

E ₀＝R ₀

a_{i}^{(i)} = k_{i}

a_{j}^{(i)} = a_{j}^{(i - 1)} + k_{i} a_{i - j}^{(i - 1)}

∨1≤j≤i-1

E_{i} = (1 - k_{i}^{2}) E_{i - 1}

∨i＝1，2，...，p

a_{j} = a_{j}^{(p)}

∨1≤j≤p

Promptly can be regarded as { a _i, R wherein _iBe autocorrelation function.

3, residual signal:

u (n) = [s (n) - Σ_{i = 1}^{p} a_{i} s (n - i)] / G

Wherein G is a drive factor.

The 3rd step, emotional feature analysis

Emotional feature analysis comprises the analysis of the variation range of the average of average fundamental frequency, fundamental frequency scope, pronunciation duration, intensity and intensity.

1, average fundamental frequency calculates and mutation analysis

A), the calculating of average fundamental frequency;

P_{mean} = \frac{Σ_{i = 1}^{f} P_{i}}{f}

Wherein, P _MeanBe the average fundamental frequency of a statement, P _iBe the pitch value of each frame, f is the number of speech frames in the statement

B), the Changing Pattern of average fundamental frequency is meant the difference of the average fundamental frequency of emotional speech and neutral voice:

AP＝P _mean-e-P _mean-n

Wherein, AP is the Changing Pattern of average fundamental frequency, P _Mean-eAnd P _Mean-nIt is respectively the average fundamental frequency of emotion statement and corresponding neutral statement.

2, fundamental frequency range computation and mutation analysis

A), the calculating of fundamental frequency scope;

R＝P _max-P _min

Wherein, R is the fundamental frequency scope of a statement, P _MaxBe the maximal value of fundamental tone in the statement, P _MinIt is the minimum value of fundamental tone in the statement.

B), the Changing Pattern of fundamental frequency scope is meant the quotient of emotional speech than the fundamental frequency scope of neutral voice:

PR＝R _e/R _n

Wherein, PR is the Changing Pattern of fundamental frequency scope, R _eAnd R _nIt is respectively the fundamental frequency scope of emotion statement and corresponding neutral statement.

3, the duration of pronunciation calculates and mutation analysis

A), the statement pronunciation duration is each statement duration from start to end.The method of determining the position employing of beginning in short and end is: speech energy and predefined energy threshold are made comparisons.When speech energy surpasses this threshold values, and all be higher than this threshold values, indicate beginning in short at ensuing continuous several frames.When being lower than this threshold values, the continuous several frames of speech energy represent end in short.According to this definition, weigh the pronunciation duration of statement with the frame number of each statement of determining in the first step;

B), the variation of pronunciation duration obtained according to the duration of emotion statement and the ratio of corresponding neutral statement duration:

D＝f _e/f _n

Wherein, D is the Changing Pattern of pronunciation duration, f _eAnd f _nIt is respectively the number of speech frames of emotion statement and corresponding neutral statement.

4, the calculating of mean intensity and mutation analysis

A), the calculating of mean intensity A);

T_{mean} = \frac{Σ_{i = 1}^{K} T_{i}}{K}

Wherein, T _MeanBe the mean intensity of a statement, P _iBe the value of each sampled point, K is the sampling number in the statement.

B), the Changing Pattern of mean intensity is meant the difference of the mean intensity of emotional speech and neutral voice:

AT＝T _mean-e-T _mean-n

Wherein, AP is the Changing Pattern of mean intensity, T _Mean-eAnd T _Mean-nIt is respectively the mean intensity of emotion statement and corresponding neutral statement.

5, the calculating of strength range and mutation analysis

A), the calculating of strength range;

TR＝R _max-R _min

Wherein, TR is the strength range of a statement, R _MaxBe the maximal value of intensity in the statement, R _MinIt is the minimum value of intensity in the statement.

B), the Changing Pattern of strength range is meant the quotient of emotional speech than the strength range of neutral voice:

TRC＝TR _e/TR _n

Wherein, TRC is the Changing Pattern of strength range, TR _eAnd TR _nIt is respectively the strength range of emotion statement and corresponding neutral statement.

The 4th step, voice correction and intermediateness phonetic synthesis

Obtain after the Changing Pattern of neutral voice and emotional speech, just can the parameter of neutral voice be revised, utilize these new parameters to obtain possessing the intermediateness voice of emotion information simultaneously by the change information that obtains.

1, revises the duration of neutral voice

Reach the effect that changes the voice duration by the alkali that adds to the frame number of neutral voice.

The D value rounded be K, will prolong the voice duration, make number of speech frames increase if D, then illustrates the variation of emotion greater than 1.In order to imitate emotional speech, every K frame is imitated the K+1 frame at the end with the K frame, K+1 frame is originally postponed and is become the K+2 frame.If D, then illustrates the variation of emotion less than 1 and will shorten the voice duration, make number of speech frames reduce.Leave out the last frame K frame of every K frame, K+1 frame originally becomes the K frame.

2, revise the fundamental frequency of neutral voice

u _m＝(u+AP)*PR

Wherein, u _mBe the fundamental frequency of revised neutral voice, u is process revised neutral speech pitch of duration, and AP is the Changing Pattern of average fundamental frequency, and PR is the Changing Pattern of fundamental frequency scope.

3, synthetic mesophase state voice

Similarly, according to revised linear forecast coding coefficient and balance information, utilize the synthetic intermediateness voice that obtain of predictive coding.

s (n) = {Gu}_{m} (n) + Σ_{i = 1}^{p} a_{i} s (n - i)

4, revise the intensity of the voice of intermediateness

At last, to obtaining voice and carry out the correction of voice intensity, obtain possessing the intermediateness voice of emotion information through predictive coding is synthetic.

T _m＝(T+AT)*TRC

Wherein, T _mBe the intensity of revised neutral voice, u is the synthetic intensity that obtains voice of predictive coding, and AT is the Changing Pattern of mean intensity, and TRC is the Changing Pattern of strength range.

The 5th step, Speaker Identification

After obtaining the intermediateness voice that possess emotion information, it is extracted the Mel cepstrum feature, and adopt gauss hybrid models (GMM) to carry out Speaker Identification.Each user is set up a gauss hybrid models, need train everyone model parameter.The voice signal (intermediateness voice, test tone) of input at first will carry out feature extraction.Speaker Identification is divided into feature extraction, model training, three parts of identification.

1, the extraction of MFCC:

A), the exponent number p of Mel cepstrum coefficient is set;

B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).

C), calculate Mel territory scale:

M_{i} = \frac{i}{p} \times 2595 \log (1 + \frac{8000 / 2.0}{700.0}), (i = 0,1,2, . . ., p)

D), calculate corresponding frequency domain scale:

f_{i} = 700 \times e^{\frac{M_{i}}{2595} \ln 10} - 1, (i = 0,1,2, . . ., p)

E), calculate each Mel territory passage φ _jOn the logarithm energy spectrum:

E_{j} = Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) {| X (k) |}^{2}

Wherein

Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) = 1 .

F), be discrete cosine transform DCT

2, GMM model training

Each speaker's phonetic feature has all formed specific distribution in feature space, can describe speaker's individual character with this distribution.Gauss hybrid models (GMM) is the characteristic distribution with the approximate speaker of linear combination of a plurality of Gaussian distribution.

The functional form of each speaker's probability density function is identical, the parameter in the different just functions.M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:

p (x) = Σ_{i = 1}^{M} P_{i} b_{i} (x)

b_{i} (x) = N (x, u_{i}, R_{i})

= \frac{1}{{(2 π)}^{p / 2} {| p_{i} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - u_{i})}^{T} R_{i}^{- 1} (x - u_{i})}

Wherein, p is the dimension of feature, b _i(x) being kernel function, is that mean value vector is u _i, covariance matrix is R _iGauss of distribution function, M (optional, as to be generally 16,32) is the exponent number of GMM model, is set at one in the past and determines integer setting up speaker model.

λ \overset{Δ}{=} {P_{i}, u_{i}, R_{i} | i = 1,2, . . ., M}

Be the parameter among the speaker characteristic distribution GMM.As the weighting coefficient that Gaussian Mixture distributes, P _iShould satisfy feasible:

{&Integral;}_{- \infty}^{+ \infty} p (x / λ) dx = 1

Because the p (x) that calculates among the GMM need ask p * p dimension square formation R _i(i=1,2 ..., M) contrary, operand is big.For this reason, with R _jBe made as diagonal matrix, inversion operation be converted into ask computing reciprocal, improve arithmetic speed.

3, identification

After the user speech input,, obtain a characteristic vector sequence through feature extraction.This sequence is input among the GMM of relevant user model parameter, obtains similarity value s.Getting the pairing user of GMM model who generates maximum s value is identification person.

Experimental result

Native system is tested on Emotional Prosody Speech sound bank.This sound bank is the emotional speech database of being set up according to database standard by interlinguistics data alliance, pronunciation character research as the different emotions voice, record by 7 professional performers (3 male target speakers and 4 women's target speakers), read aloud a series of specific statements that give in English, mainly be date and numeral, contained 14 kinds of different emotions types.The method of recording is the different tone, intonation and the word speed when allowing the corresponding emotion of actor, each speaker does not wait in the record length of every kind of emotion, between 10 seconds to 40 seconds, also have only a few to reach 50 seconds greatly, the total record length of each speaker is greatly about 5,6 minutes.

We simultaneously on this storehouse with traditional method for distinguishing speek person (Baseline) with add linear forecast coding analysis and synthesize the method for distinguishing speek person (Unmodified LPC) that does not still carry out the feature correction and carried out same experiment, be used for and native system (Modified LPC) compares.These two kinds of methods all are to utilize neutral voice that the speaker is carried out modeling, the priori to any emotion of no use.

The method for distinguishing speek person of traditional no any processing is based on the first step and the 6th step of this explanation.After neutral voice are carried out pre-service, it is extracted the Mel cepstrum feature, utilize gauss hybrid models speaker's modeling.Same, the emotional speech of test carries out rightly with the speaker model of building up after the process of extracting through pre-service and Mel cepstrum feature, obtains that the highest model of branch is pairing artificially says other person in a minute.

Add linear forecast coding analysis and still do not carry out the method for distinguishing speek person of feature correction on the basis of traditional method for distinguishing speek person with synthetic, after the voice pre-service, carry out linear forecast coding analysis, initial linear predictive coding coefficient and the residual signal (not doing under the situation of any correction) that utilizes analysis to obtain synthesizes afterwards, again voice are extracted the Mel cepstrum feature, utilize gauss hybrid models afterwards speaker's modeling.Same, the emotional speech of test is through after the pre-service, also many linear one-step prediction Coded Analysis and synthetic processes.

We are to the recognition result assessment of 14 kinds of emotion test voice to the speaker model of neutral voice.

Experimental result is as follows:

Wherein, the traditional method for distinguishing speek person of " Baseline " expression, " Unmodified LPC " expression adds linear forecast coding analysis and the synthetic method for distinguishing speek person that does not still carry out the feature correction, " Modified LPC " expression method that native system proposed." IR " expression speaker's recognition correct rate promptly under the user be legal situation, compares all validated users in applicant and the database, thereby provides similar users, if the applicant is same individual with the user who provides, then discerns correctly.

Experimental result shows, the method for distinguishing speek person of traditional no any processing is under gathering voice and the corresponding to situation of tested speech affective state (when being neutral voice), can reach discrimination preferably, but when the test emotion changed, performance descended sharply.

Before voice are extracted the Mel cepstrum feature, carry out linear forecast coding analysis and synthetic, can not lose the distinctive feature of speaker.

This recognizer can according to the Changing Pattern of voice, be revised neutral voice by to the identical emotional speech of text and the analysis of neutral voice, emotion information is joined in the neutral voice go.This algorithm can not need specific speaker's emotional speech according to the priori of emotion, and has improved the discrimination of Speaker Identification, so strengthened the robustness of Speaker Recognition System under speaker's emotion fluctuation situation.

Claims

1, a kind of method for distinguishing speek person based on emotion migration rule and voice correction, it is characterized in that: at first neutral voice and the emotional speech with same text extracted the phonetic feature that wherein can reflect emotion information, and these features are analyzed and contrast, characteristic parameter in the neutral voice of gathering according to the Changing Pattern correction of these features then obtains possessing the intermediateness voice of emotion information; When the affective state of contrast phone is not neutrality, sets up speaker model with the intermediateness voice that possess corresponding emotion information and compare.

2, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1, it is characterized in that: the characteristic parameter that will revise in the described neutral voice is for to divide the laggard line linearity predictive coding of frame to analyze resulting linear forecast coding coefficient, balance information and voice intensity audio frequency.

3, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1 is characterized in that: described possess the intermediateness voice of emotion information for the method for utilizing linear predictive coding to synthesize according to the later neutral speech characteristic parameter of affective characteristics correction is synthesized the voice that obtain.

4, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 1 is characterized in that: the model of described speaker model for gauss hybrid models the Mel cepstrum feature coefficient modeling of extracting from the intermediateness voice that possess emotion information being obtained.

5, according to claim 1 or 2 or 3 or 4 described method for distinguishing speek person, it is characterized in that: the key step of this method based on emotion migration rule and voice correction:

5.1), the audio frequency pre-service: the audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, four parts of pre-emphasis and windowing;

5.2), the emotional speech feature extraction: the feature extraction of speech frame comprises the extraction of fundamental frequency, linear forecast coding coefficient and residual signal;

5.3), emotional feature analysis: comprise the analysis of the variation range of the average of average fundamental frequency, fundamental frequency scope, pronunciation duration, intensity and intensity;

5.3.1), average fundamental frequency calculates and mutation analysis

A), the calculating of average fundamental frequency;

P_{mean} = \frac{Σ_{i = 1}^{f} P_{i}}{f}

Wherein, P _MeanBe the average fundamental frequency of a statement, P _iBe the pitch value of each frame, f is the number of speech frames B in the statement), the Changing Pattern of average fundamental frequency is meant the difference of the average fundamental frequency of emotional speech and neutral voice:

AP＝P _mean-e-P _mean-n

5.3.2), fundamental frequency range computation and mutation analysis

A), the calculating of fundamental frequency scope;

R＝P _max-P _min

PR＝R _e/R _n

5.3.3), duration of pronunciation calculates and mutation analysis

A), the statement pronunciation is each statement duration from start to end duration, by step 5.1) in calculate each statement frame number determine pronunciation duration of statement;

B), the variation of pronunciation duration obtained according to the duration of emotion statement and the ratio of corresponding neutral statement duration: D=f _e/ f _n

Wherein, D is the Changing Pattern of pronunciation duration, f _eAnd f _nIt is respectively the number of speech frames of emotion statement and corresponding neutral statement;

5.3.4), the calculating and the mutation analysis of mean intensity

A), the calculating of mean intensity;

P_{mean} = \frac{Σ_{i = 1}^{K} T_{i}}{K}

AT＝T _mean-e-T _mean-n

5.3.5), the calculating and the mutation analysis of strength range

A), the calculating of strength range;

TR＝R _max-R _min

Wherein, TR is the strength range of a statement, R _MaxBe the maximal value of intensity in the statement, R _MinIt is the minimum value of intensity in the statement;

TRC＝TR _e/TR _n

5.4), voice correction and intermediateness phonetic synthesis:

Obtain after the Changing Pattern of neutral voice and emotional speech,, the parameter of neutral voice is revised, utilize these new parameters to obtain possessing the intermediateness voice of emotion information simultaneously by the change information that obtains;

5.5), Speaker Identification

After obtaining the intermediateness voice that possess emotion information, it is extracted the Mel cepstrum feature, and adopt gauss hybrid models to carry out Speaker Identification, and each user is set up a gauss hybrid models, everyone model parameter is trained, the voice signal of input, be intermediateness voice and test tone, at first will carry out feature extraction, Speaker Identification is divided into feature extraction, model training, three parts of identification.

6, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: described emotional speech feature extraction is specially:

6.1), the extraction of Pitch (fundamental frequency):

A), the hunting zone f of fundamental frequency is set _Floor=50 (Hz), f _Ceiling=1250 (Hz);

B), the span f of the fundamental frequency of voice is set _Min=50 (Hz), f _Max=550 (Hz);

C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);

D), calculate the subharmonic-harmonic wave ratio of each frequency

SHR＝SS/SH

Wherein

SS = Σ_{n = 1}^{N} X ((n - 1 / 2) f),

SH = Σ_{n = 1}^{N} X (nf),

N＝f _ceiling/f

E), find out the highest frequency f of SHR ₁

H) if f ₂F _Max, perhaps f ₂SHR 0.2, Pitch=f ₁

I), other situations, Pitch=f ₂

From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C＜0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0;

K), at last whole Pitch values is carried out median smoothing filtering;

6.2), linear forecast coding coefficient:

A), the exponent number p of linear predictive coding is set;

B), calculate p rank LPC coefficient { a _i(i=1,2 ..., p), by stepping type:

R_{i} = Σ_{n = i}^{N - 1} s (n) s (n - i)

E ₀＝R ₀

a_{i}^{(i)} = k_{i}

E_{i} = (1 - k_{i}^{2}) E_{i - 1}

a_{j} = a_{j}^{(p)}

Promptly can be regarded as { a _i, R wherein _iBe autocorrelation function, K _iBe partial correlation (Parcor) coefficient, E _iBe i rank optimum linear prediction inverse filtering remainder energy;

6.3), residual signal:

u (n) = [s (n) - Σ_{i = 1}^{p} a_{i} s (n - i)] / G;

Wherein G is a drive factor.

7, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: described Speaker Identification concrete steps are:

7.1), MFCC, the i.e. extraction of Mel cepstrum coefficient:

A), the exponent number p of Mel cepstrum coefficient is set;

B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);

C), calculate Mel territory scale:

M_{i} = \frac{i}{p} \times 2595 \log (1 + \frac{8000 / 2.0}{700.0}), (i = 0,1,2, . . ., p)

D), calculate corresponding frequency domain scale:

f_{i} = 700 \times e^{\frac{M_{i}}{2595} \ln 10} - 1, (i = 0,1,2, . . ., p)

E), calculate each Mel territory passage φ _jOn the logarithm energy spectrum:

E_{j} = Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) {| X (k) |}^{2},

Wherein

Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) = 1;

F), be discrete cosine transform DCT;

7.2), the GMM model training:

M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:

b_{i} (x) = N (x, u_{i}, R_{i})

p (x) = Σ_{i = 1}^{M} P_{i} b_{i} (x),

= \frac{1}{{(2 π)}^{p / 2} {| p_{i} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - u_{i})}^{T} R_{i}^{- 1} (x - u_{i})}

Wherein, p is the dimension of feature, b _i(x) being kernel function, is that mean value vector is u _i, covariance matrix is R _iGauss of distribution function; Be the exponent number of GMM model, be set at one in the past and determine integer setting up speaker model;

λ \overset{Δ}{=} {P_{i}, u_{i}, R_{i} | i = 1,2, . . ., M}

Be the parameter among the speaker characteristic distribution GMM, as the weighting coefficient that Gaussian Mixture distributes, P _iShould satisfy feasible:

{&Integral;}_{- \infty}^{+ \infty} p (x / λ) dx = 1;

7.3), identification:

After the user speech input,, obtain a characteristic vector sequence through feature extraction; This sequence is input among the GMM of relevant user model parameter, obtains similarity value s; Getting the pairing user of GMM model who generates maximum s value is identification person.

8, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: the method for determining the position employing of beginning in short and end is: speech energy and predefined energy threshold are made comparisons, when speech energy surpasses this threshold values, and all be higher than this threshold values at ensuing continuous several frames, indicate beginning in short, when the continuous several frames of speech energy are lower than this threshold values, represent end in short.

9, the method for distinguishing speek person based on emotion migration rule and voice correction according to claim 5, it is characterized in that: concrete steps are as follows in voice correction and intermediateness phonetic synthesis:

9.1), revise duration of neutral voice; Change the voice duration by the frame number that increases or reduce neutral voice;

The D value rounded be K, will prolong the voice duration, make number of speech frames increase if D, then illustrates the variation of emotion greater than 1; In order to imitate emotional speech, every K frame is imitated the K+1 frame at the end with the K frame, K+1 frame is originally postponed and is become the K+2 frame; If D, then illustrates the variation of emotion less than 1 and will shorten the voice duration, make number of speech frames reduce, leave out the last frame K frame of every K frame, K+1 frame originally becomes the K frame;

9.2), revise the fundamental frequency of neutral voice

u _m＝(u+AP)*PR；

Wherein, u _mBe the fundamental frequency of revised neutral voice, u is process revised neutral speech pitch of duration, and AP is the Changing Pattern of average fundamental frequency, and PR is the Changing Pattern of fundamental frequency scope;

9.3), synthetic mesophase state voice

Similarly, according to revised linear forecast coding coefficient and balance information, utilize the synthetic intermediateness voice that obtain of predictive coding;

s (n) = {Gu}_{m} (n) + Σ_{i = 1}^{p} a_{i} s (n - i)

9.4), revise the intensity of the voice of intermediateness

At last, to obtaining voice and carry out the correction of voice intensity, obtain possessing the intermediateness voice of emotion information through predictive coding is synthetic; T _m=(T+AT) * TRC