CN102238190A

CN102238190A - Identity authentication method and system

Info

Publication number: CN102238190A
Application number: CN2011102180452A
Authority: CN
Inventors: 潘逸倩; 胡国平; 何婷婷; 魏思; 胡郁; 王智国; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2011-08-01
Filing date: 2011-08-01
Publication date: 2011-11-09
Anticipated expiration: 2031-08-01
Also published as: CN102238190B

Abstract

The invention discloses an identity authentication method and an identity authentication system. The method comprises the following steps of: in the login of a user, receiving a continuous voice signal recorded by the current login user; extracting a voiceprint characteristic sequence from the continuous voice signal; computing likelihood between the voiceprint characteristic sequence and a background model; computing the likelihood between the voiceprint characteristic sequence and a speaker model of the current login user, wherein the speaker model is a polyhybrid Gaussian model constructed according to the repetition times and frame number of the registration voice signals recorded in the login of the current login user; computing a likelihood ratio according to the likelihood between the voiceprint characteristic sequence and the speaker model and the likelihood between the voiceprint characteristic sequence and the background model; and if the likelihood ratio is greater than a preset threshold value, determining the current login user is an effectively authenticated user, otherwise determining the current login user is an unauthenticated user. By the method and the system, the voiceprint-password-based identity authentication accuracy can be improved.

Description

Identity identifying method and system

Technical field

The present invention relates to identity identification technical field, particularly a kind of identity identifying method and system.

Background technology

(Voiceprint Recognition VPR) is also referred to as Speaker Identification to Application on Voiceprint Recognition, and two classes are arranged, i.e. speaker's identification and speaker verification.The former in order to judge that certain section voice are some philtrums which is said, be " multiselect one " problem; And the latter is " differentiating one to one " problem in order to confirm that whether certain section voice are that the someone of appointment is said.Different tasks can be used different Application on Voiceprint Recognition technology with using.

Voiceprint is meant according to the voice signal that collects confirms speaker ' s identity, belongs to the differentiation problem of " one to one ".The voiceprint system of main flow has adopted the framework based on hypothesis testing now, by calculate respectively the vocal print signal with respect to the likelihood score of speaker model and background model and relatively they likelihood ratio and in advance rule of thumb the threshold size of setting confirm.Obviously the accuracy of background model and speaker model will directly have influence on the voiceprint effect, and the big more then modelling effect of amount of training data is good more under setting based on the statistical model of data-driven.

The vocal print cipher authentication is the relevant speaker ' s identity authentication method of a kind of text.This method requires the user speech input to determine cryptogram, and confirms speaker ' s identity in view of the above.The phonetic entry of determining cryptogram is all adopted in user's registration and authentication in this application, thereby its vocal print is often more consistent, can obtain accordingly than the speaker verification of text-independent and better authenticate effect.

Now the vocal print cipher authentication system the most the technology path of main flow be the GMM-UBM algorithm, promptly adopted mixed Gaussian (Gaussian Mixture Model respectively, GMM) the modeling background model (Universal Background Model, UBM) and speaker model.The UBM model is used to describe the general character of speaker's vocal print.Because each speaker's vocal print always has specificity separately, the UBM model based on many speakers training data needs the complicated model structure to satisfy the match requirement of distribution separate data accordingly.The GMM model of common selection 1024 of UBM model at present even bigger Gaussage.

Online training obtains speaker model according to the registration voice when the user registers by system.Because registration is often limited with speech samples, directly trains complex model in view of the above because data are sparse and easily cause problems such as model is accurate inadequately.For this reason, in the prior art, normally be that initial model is by a small amount of speaker's data adjustment model partial parameters of various adaptive approachs bases with the background model, as at present commonly used based on maximum a posteriori probability (Maximum A Posterior, MAP) adaptive algorithms etc. are current speaker's individual character with user's vocal print general character self adaptation.

Between each Gauss of the mixed Gauss model of speaker under the adaptive updates algorithm and common background Gauss model, form man-to-man corresponding relation, therefore, make that the speaker model parameter is too much, in the less vocal print cipher authentication system of log-on data amount, cause following problem easily:

1. model redundancy: speaker model is to be obtained by several sample datas training that repeat all over the registration speech cipher in the vocal print cipher authentication system.Very few sample data causes adaptive algorithm can only upgrade part Gauss in the initial back-ground model, and has much all kept and the similar Gaussian component of background model.The existence of redundancy model parameter causes storing and the increase of computing pressure easily, and then the efficient of influence decoding.

2. the model training amount is bigger: in adaptive algorithm, and each Gauss's of 1024 even bigger Gaussage of needs calculating initial back-ground model sample statistic, and to its parameter update.

3. in adaptive algorithm,, thereby often directly adopt the variance of background model because the variance revaluation of speaker model is comparatively difficult.Because background model is based on the model of the simulation vocal print general character that many speakers training data obtains, its model probability distribution variance is often bigger.And the characteristics of the specific vocal print of speaker of the variance of speaker model simulation have specificity.Directly can not embody the speaker model characteristics well, reduce the differentiation between the different speaker models, thereby influenced recognition accuracy with the background model variance.

Summary of the invention

The embodiment of the invention provides a kind of identity identifying method and system, to improve the accuracy rate of carrying out authentication based on the vocal print password.

The embodiment of the invention provides a kind of identity identifying method on the one hand, comprising:

When the user logins, receive the continuous speech signal of current login user typing;

Extract the vocal print characteristic sequence in the described continuous speech signal, described vocal print characteristic sequence comprises one group of vocal print feature;

Calculate the likelihood score of described vocal print characteristic sequence and background model;

Calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number makes up;

According to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, calculate likelihood ratio;

If described likelihood ratio, determines then that described current login user is effective authenticated user greater than preset threshold, otherwise determines that described current login user is non-authenticated user.

The embodiment of the invention provides a kind of identity authorization system on the other hand, comprising:

The voice signal receiving element is used for when the user logins, and receives the continuous speech signal of current login user typing;

Extraction unit is used for extracting the vocal print characteristic sequence of described continuous speech signal, and described vocal print characteristic sequence comprises one group of vocal print feature;

First computing unit is used to calculate the likelihood score of described vocal print characteristic sequence and background model;

Second computing unit, be used to calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number makes up;

The 3rd computing unit is used for according to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, calculates likelihood ratio;

Judging unit when being used for the likelihood ratio that calculates at described the 3rd computing unit greater than preset threshold, determines that described current login user is effective authenticated user, otherwise determines that described current login user is non-authenticated user.

Identity identifying method that the embodiment of the invention provides and system, according to the vocal print characteristic sequence in the continuous speech signal of current login user typing, calculate the speaker model of vocal print characteristic sequence and current login user and the likelihood score of background model respectively, calculate likelihood ratio then, determine according to the likelihood ratio that obtains whether current login user is effective authenticated user.Because in this scheme, employed speaker model is the polyhybird Gauss model that the voice signal of typing makes up during according to the registration of current login user, thereby can simulate the characteristics that described user says the difference pronunciation variation of same voice signal (being password) existence, improve the accuracy rate of carrying out authentication based on the vocal print password.

Description of drawings

In order to be illustrated more clearly in technical scheme of the invention process, to do to introduce simply to the accompanying drawing of required use among the embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of embodiment of the invention identity identifying method;

Fig. 2 is a kind of flow chart of background model parameters training process in the embodiment of the invention;

Fig. 3 is that traditional adaptive algorithm of utilizing makes up the flow chart of speaker model;

Fig. 4 is the flow chart that makes up speaker model in the embodiment of the invention;

Fig. 5 is a kind of structural representation of embodiment of the invention identity authorization system;

Fig. 6 is the another kind of structural representation of embodiment of the invention identity authorization system.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.

As shown in Figure 1, be the flow chart of embodiment of the invention identity identifying method, may further comprise the steps:

Step 101 when the user logins, receives the continuous speech signal of current login user typing.

Step 102 is extracted the vocal print characteristic sequence in the described continuous speech signal.

This vocal print characteristic sequence comprises one group of vocal print feature, can distinguish different speakers effectively, and same speaker's variation is kept relative stability.

Described vocal print feature mainly contains: spectrum envelope parameter phonetic feature, fundamental tone profile, formant frequency bandwidth feature, linear predictor coefficient, cepstrum coefficient etc.Consider the quantification property of above-mentioned vocal print feature, the quantity of training sample and the problems such as evaluation of systematic function, can select MFCC (Mel Frequency Cepstrum Coefficient for use, the Mel frequency cepstral coefficient) feature, every frame speech data that the long 25ms frame of window is moved 10ms is done short-time analysis and is obtained MFCC parameter and single order second differnce thereof, amounts to 39 dimensions.Like this, every voice signal can be quantified as one 39 dimension vocal print feature vector sequence X.

Step 103 is calculated the likelihood score of described vocal print characteristic sequence and background model.

Frame number is that the vocal print feature vector sequence X of T is corresponding to the likelihood score of background model (UBM):

p (X | UBM) = \frac{1}{T} Σ_{t = 1}^{T} Σ_{m = 1}^{M} c_{m} N (X_{t}; μ_{m}, Σ_{m}) - - - (1)

Wherein, c _mBe m Gauss's weight coefficient, satisfy

μ _mAnd ∑ _mBe respectively m Gauss's average and variance.Wherein N (.) satisfies normal distribution, is used to calculate t vocal print characteristic vector X constantly _tLikelihood score on single Gaussian component:

N (X_{t}; μ_{m}, Σ_{m}) = \frac{1}{\sqrt{{(2 π)}^{n} | Σ_{m} |}} e^{- \frac{1}{2} {(X_{t} - μ_{m})}^{'} {Σ_{m}}^{- 1} (X_{t} - μ_{m})} - - - (2)

Step 104, calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number makes up.

Because speaker model is the polyhybird Gauss model that the voice signal of typing makes up during according to described current login user registration, therefore, in this step, when calculating the likelihood score of speaker model of described vocal print characteristic sequence and described current login user, need calculate the likelihood score of each vocal print feature and each mixed Gauss model in the described vocal print characteristic sequence respectively; Determine the likelihood score of the speaker model of described vocal print feature and described current login user then according to all likelihood scores that calculate.Specifically multiple implementation can be arranged, such as:

1. calculate earlier the likelihood score of described vocal print characteristic sequence and each mixed Gauss model respectively, and then determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

In this mode, can calculate in the described vocal print characteristic sequence likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model respectively; Select the likelihood score of the time average of the likelihood score summation that mixed Gauss model of one group of vocal print feature correspondence calculates in the described vocal print characteristic sequence as described vocal print characteristic sequence and this mixed Gauss model.

And behind the likelihood score that obtains described vocal print characteristic sequence and each mixed Gauss model, can select one of them maximum or average likelihood score as the speaker model of described vocal print characteristic sequence and described current login user.

2. calculate earlier in the described vocal print characteristic sequence each vocal print feature respectively with respect to the likelihood score of described polyhybird Gauss model, and then determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

In this mode, can calculate in the described vocal print characteristic sequence likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model respectively; Select in the described vocal print characteristic sequence maximum in each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature the likelihood score as the likelihood score of this vocal print feature and described polyhybird Gauss model; Perhaps, select in the described vocal print characteristic sequence likelihood score of the mean value of all likelihood scores that each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature as this vocal print feature and described polyhybird Gauss model.

And in obtaining described vocal print feature behind the likelihood score of each vocal print feature and polyhybird Gauss model, the summation time average of all vocal print feature likelihood scores of selecting the vocal print characteristic sequence is as the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

Certainly, other selection modes can also be arranged, average etc. such as all likelihood scores that calculate are weighted, this embodiment of the invention is not done qualification.

Step 105 according to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, is calculated likelihood ratio.

Likelihood ratio is:

p = \frac{p (X | U)}{p (X | UBM)} - - - (3)

Wherein, p (X|U) is the likelihood score of described vocal print feature and speaker model, and p (X|UBM) is the likelihood score of described vocal print feature and background model.

Whether step 106 judges described likelihood ratio greater than preset threshold, if then execution in step 107; Otherwise, execution in step 108.

Above-mentioned threshold value can be preestablished by system, in general, this threshold value is big more, then the sensitivity of system is high more, require user's pronunciation of the voice signal (password) of typing during as far as possible according to registration when login, otherwise then the sensitivity of system is lower, and there is certain variation in the pronunciation the when pronunciation of the voice signal of typing is with registration when allowing the user to login.

Step 107 determines that described current login user is effective authenticated user.

Step 108 determines that described current login user is non-authenticated user.

Need to prove, in order to improve the robustness of system, before above-mentioned steps 101 and step 102, can also carry out noise reduction process to described continuous speech signal, such as, at first, continuous voice signal is divided into independently voice snippet and non-voice segment by short-time energy and short-time zero-crossing rate analysis to voice signal.Reduce the interference of channel noise and background noise then by the front end noise reduction process, improve the voice signal to noise ratio, handling for follow-up system provides clean signal.

The existing relative stability of user's vocal print feature has variability again.Be subjected to the influence of health, age, mood etc. on the one hand easily, be subjected to the interference of external environment noise and voice collecting channel on the other hand easily, so the different vocal prints that speaker model needs to distinguish same speaker preferably change.In embodiments of the present invention, speaker model is the polyhybird Gauss model that the voice signal of typing makes up during according to described current login user registration, the frame number of the number of repetition of the voice signal of typing and this voice signal was relevant when the Gaussage of mixed Gauss model number and each mixed Gauss model was registered with the user, thereby can utilize a plurality of mixed Gauss model analog subscribers to the characteristics that the difference pronunciation of saying same password (being above-mentioned voice signal) existence changes, improve the accuracy rate of carrying out authentication based on the vocal print password.

In embodiments of the present invention, background model is used to describe the general character of speaker's vocal print, this background model needs to make up in advance, specifically can adopt modes more of the prior art, such as, the mixed Gauss model simulation background model of employing 1024 or bigger Gaussage, its model parameter training process as shown in Figure 2.

Step 201 is extracted the vocal print feature respectively from many speakers training utterance signal, each vocal print feature is as a characteristic vector.

Step 202 utilizes clustering algorithm that above-mentioned characteristic vector is carried out cluster, obtains K Gauss's initialization average, and K is the mixed Gauss model number that sets in advance.

Such as, (Gray) clustering algorithm approaches optimum regeneration code book by trained vector collection and certain iterative algorithm for Linde, Buzo can to adopt traditional LBG.

Step 203 utilizes EM (Expectation Maximization) algorithm iteration to upgrade the weight coefficient of above-mentioned average, variance and each Gauss's correspondence, obtains background model.

Concrete iteration renewal process is same as the prior art, is not described in detail at this.

Certainly, can also adopt other modes to make up background model, this embodiment of the invention is not done qualification.

In embodiments of the present invention, need to distinguish the user and be in login mode or registration mode, if login mode, then need carry out authentication to this user according to flow process shown in Figure 1 based on the vocal print password, if registration mode, then need to receive the registration voice signal of described user's typing, and make up described user's speaker model according to described registration voice signal.

The building process of the building process of speaker model and traditional speaker model is diverse in the embodiment of the invention, for this point is described better, at first the building process of traditional speaker model is done simple declaration below.

The building process of traditional speaker model is to be initial model with the background model, by adaptive approach adjustment model partial parameters, as the present adaptive algorithm based on maximum a posteriori probability the most commonly used etc.Adaptive algorithm is current speaker's individual character according to a small amount of speaker's data with user's vocal print general character self adaptation, and it specifically trains flow process as shown in Figure 3, may further comprise the steps:

Step 301 is extracted the vocal print feature from the registration voice signal of user's typing.

Step 302 is utilized the average μ of described vocal print feature adaptive updates background model mixed Gaussian _m

Particularly, new Gaussian mean μ _mBe calculated as the weighted average of sample statistic and original Gaussian mean, that is:

\hat{μ_{m}} = \frac{Σ_{t = 1}^{T} γ_{m} (x_{t}) x_{t} + τ μ_{m}}{Σ_{t = 1}^{T} γ_{m} (x_{t}) + τ} - - - (4)

Wherein, x _tRepresent t frame vocal print feature, γ _m(x _t) representing that t frame vocal print feature falls within m Gauss's probability, τ is a forgetting factor, is used for historical average of balance and the sample update intensity to new average.In general, the τ value is big more, and then new average is restricted by original average mainly.And if the τ value is less, then new average has more embodied the characteristics that new samples distributes mainly by the sample statistic decision.

Step 303 is duplicated the speaker model variance of background model variance as described user.

Step 304 generates described user's speaker model.

In embodiments of the present invention, need when the user registers, receive the registration voice signal of described user's typing, and make up described user's speaker model according to described registration voice signal.This speaker model is made of a plurality of mixed Gauss models, the characteristics that the difference pronunciation of saying same password and existing changed with the simulation speaker, and, illustrate that each mixed Gauss model is trained variance separately in the human model, cause variance excessive directly to duplicate the background model variance in the solution conventional method, do not meet the problem of practical application.

As shown in Figure 4, be the flow chart that makes up speaker model in the embodiment of the invention, may further comprise the steps:

Step 401, the registration voice signal that user recording is gone into saves as a conventional sequence.

It is inferior to suppose that the user registers the same password content N of input (such as N=2,3 etc.), then obtains N independently conventional sequence.

Step 402 is extracted the vocal print feature from the conventional sequence that obtains.

The step 102 of detailed process and front is similar, is not described in detail at this.

Step 403, all mixed Gauss models of the speaker model of determining described user according to the number of repetition and the frame number of described registration voice signal.

In the vocal print password was used, the user inputed unified content of text and uses as password.Such as, the mixed Gauss model number that can set described user's speaker model equals the number of repetition of described registration voice signal, and set the frame number that equals the registration voice signal of described mixed Gauss model correspondence corresponding to the Gaussage of each mixed Gauss model, specifically can be expressed as:

p (O | M_{k}) = Σ_{m = 1}^{T (k)} c_{m}^{k} N (O; μ_{m}^{k}, Σ_{m}^{k}) - - - (5)

Wherein, T (k) is mixed Gauss model M _kGaussage, be equal to the frame number of k speech samples of model correspondence.And

Be respectively mixed Gauss model M _kWeight coefficient, average and the variance of m Gaussian component.

Certainly, the embodiment of the invention does not limit the topological mode of above-mentioned speaker model, the Gaussage of its mixed Gauss model number and each mixed Gauss model can be not and the number of repetition and the complete correspondent equal of frame number of described voice signal yet, also can choose the number of repetition of mixed Gauss model number by adopting clustering algorithm less than described registration voice signal, equally, the Gaussage of each mixed Gauss model also can be less than the frame number of described registration voice signal.

Step 404 is estimated the Gaussian mean parameter of all mixed Gauss models according to the vocal print feature of extracting.

In embodiments of the present invention, determine the Gaussian mean parameter of its corresponding mixed Gauss model according to single training sample.Particularly, each Gaussian mean vector that can mixed Gauss model is set to the characteristic vector value of sample, promptly

Wherein

The average of representing m Gauss of k mixed model, and The vocal print characteristic vector of representing the m frame voice of k voice signal.

Step 405 is estimated Gauss's variance parameter of all mixed Gauss models according to the vocal print feature of extracting.

Can suppose that a plurality of Gausses of each mixed Gaussian have in the speaker model is the unified matrix of the overall situation, to realize the variance revaluation problem on the less data.Under this hypothesis,

(promptly the covariance matrix of all Gaussian component of k mixed Gauss model has identical matrix numerical value).Particularly, to given sample vocal print characteristic sequence O _k, according to

Promptly all remain the statistical information of sample vocal print characteristic sequences, revaluation mixed Gauss model M ^kVariance, be calculated as follows:

Σ^{k} = \frac{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} (γ_{m}^{k} (O_{i}^{n}) (O_{i}^{n} - μ_{m}^{k}) {(O_{i}^{n} - μ_{m}^{k})}^{T})}{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} γ_{m}^{k} (O_{i}^{n})} - - - (6)

Wherein,

I the speech frame (being sample) of representing n sentence log-in password (promptly registering voice signal),

M Gaussian mean representing k mixed Gauss model,

The expression sample

Dropping on average is

Gauss on probability.

Like this, to each independent mixed Gaussian M of speaker model ^k, can utilize non-O ^kSample data obtain the corresponding variance parameter.If the registration voice signal is the N sentence, then obtain N different variance matrix.

Especially, can suppose that this variance matrix is that diagonal matrix is with the sparse problem of further minimizing data, promptly

A plurality of Gausses' the variance that can also further consider a plurality of mixed Gauss models of speaker model in addition has the unified diagonal matrix of the overall situation, to solve the model variance revaluation problem under the sparse situation of data better.Under this hypothesis,

Step 406 is estimated Gauss's weight coefficient parameter of all mixed Gauss models.

The Gaussian mean of considering mixed Gauss model in the present embodiment directly determined by the sample vector, thereby each Gauss exists with 1 probability on sample, and promptly probability of occurrence is identical.The weight coefficient equalization of each Gauss in the mixed model can be set for this reason, that is: in the present embodiment

c_{m}^{k} = c^{k} = \frac{1}{T (k)} - - - (7)

Utilize above-mentioned flow process shown in Figure 4, can be according to the sentence number and the long topological structure that the number of the mixed Gauss model in the speaker model is set and determines model of sentence of registration voice, by reasonable setting to Gaussian mean, variance and the weight coefficient of all mixed Gauss models, solved the sparse training problem of data that exists in the system of tradition based on the vocal print cipher authentication effectively, improve the differentiation between the mixed Gauss model, and then can improve the accuracy rate of authentication.And the mixed Gauss model of use is littler more effective, with respect to prior art, has improved the required memory pressure of arithmetic speed and storage data greatly.

Correspondingly, the embodiment of the invention also provides a kind of identity authorization system, as shown in Figure 5, is a kind of structural representation of embodiment of the invention identity authorization system.

In this embodiment, described system comprises:

Voice signal receiving element 501 is used for when the user logins, and receives the continuous speech signal of current login user typing;

Extraction unit 502 is used for extracting the vocal print characteristic sequence of described continuous speech signal;

First computing unit 503 is used to calculate the likelihood score of described vocal print characteristic sequence and background model;

Second computing unit 504, be used to calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number makes up;

The 3rd computing unit 505 is used for according to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, calculates likelihood ratio;

Judging unit 506 when being used for the likelihood ratio that calculates at described the 3rd computing unit 505 greater than preset threshold, determines that described current login user is effective authenticated user, otherwise determines that described current login user is non-authenticated user.

Above-mentioned this vocal print characteristic sequence comprises one group of vocal print feature, can distinguish different speakers effectively, and same speaker's variation is kept relative stability.

Such as, the vocal print feature that extraction unit 502 can extract mainly contains: spectrum envelope parameter phonetic feature, fundamental tone profile, formant frequency bandwidth feature, linear predictor coefficient, cepstrum coefficient etc.Consider the quantification property of above-mentioned vocal print feature, the quantity of training sample and the problems such as evaluation of systematic function, can select MFCC (Mel Frequency Cepstrum Coefficient for use, the Mel frequency cepstral coefficient) feature, every frame speech data that the long 25ms frame of window is moved 10ms is done short-time analysis and is obtained MFCC parameter and single order second differnce thereof, amounts to 39 dimensions.Like this, every voice signal can be quantified as one 39 dimension vocal print characteristic sequence X.

The above-mentioned background model can be that system makes up in advance and is written into when initialization, and the concrete building process embodiment of the invention of background model is not done qualification.

Above-mentioned speaker model is the polyhybird Gauss model that the voice signal of typing makes up during according to described current login user registration, and correspondingly, in embodiments of the present invention, above-mentioned second computing unit 504 can have multiple implementation, such as:

In one implementation, described second computing unit 504 comprises: first computation subunit and first is determined subelement.Wherein:

Described first computation subunit is used for calculating respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model;

Described first determines subelement, is used for determining according to the result of calculation of described first computation subunit likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

Above-mentioned first computation subunit can comprise: first computing module and first is selected module, wherein:

Described first computing module is used for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and the described polyhybird Gauss model;

Described first selects module, is used for selecting the likelihood score of the time average of the likelihood score summation that the corresponding mixed Gauss model of one group of vocal print feature of described vocal print characteristic sequence calculates as described vocal print characteristic sequence and this mixed Gauss model.

Correspondingly, above-mentioned first determines that subelement also can have multiple implementation, such as, obtain the likelihood score of described vocal print characteristic sequence and each mixed Gauss model in first computation subunit after, first determines that subelement can select one of them maximum or the average likelihood score as the speaker model of described vocal print characteristic sequence and described current login user.

In another kind of implementation, described second computing unit 504 comprises: second computation subunit and second is determined subelement.Wherein:

Described second computation subunit is used for calculating respectively the likelihood score of each vocal print feature of described vocal print characteristic sequence with respect to described polyhybird Gauss model;

The described second chooser unit is used for determining according to the result of calculation of described second computation subunit likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

Above-mentioned second computation subunit can comprise: second computing module and second is selected module, wherein:

Described second computing module is used for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and the described polyhybird Gauss model;

Described second selects module, is used for selecting maximum in each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of vocal print feature of described vocal print characteristic sequence the likelihood score as the likelihood score of this vocal print feature and described polyhybird Gauss model; Perhaps select in the described vocal print characteristic sequence likelihood score of the mean value of all likelihood scores that each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature as this vocal print feature and described polyhybird Gauss model.

Correspondingly, above-mentioned second determines that subelement also can have multiple implementation, such as, obtain in the described vocal print characteristic sequence behind the likelihood score of each vocal print feature with respect to described polyhybird Gauss model in second computation subunit, second determines that it is the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user with respect to the time average of the likelihood score of described polyhybird Gauss model that subelement can be selected each vocal print feature in the described vocal print characteristic sequence.

Certainly, second computing unit 504 can also adopt other modes to realize, this embodiment of the invention is not done qualification.

The concrete computational process of above-mentioned first computing unit 503, second computing unit 504 and the 3rd computing unit 505 can not repeat them here with reference to the description in the embodiment of the invention identity identifying method of front.

In embodiments of the present invention, speaker model is the polyhybird Gauss model that the voice signal of typing makes up during according to described current login user registration, the frame number of the number of repetition of the voice signal of typing and this voice signal was relevant when the Gaussage of mixed Gauss model number and each mixed Gauss model was registered with the user, thereby can utilize a plurality of mixed Gauss model analog subscribers to the characteristics that the difference pronunciation of saying same password (being above-mentioned voice signal) existence changes, improve the accuracy rate of carrying out authentication based on the vocal print password.

As shown in Figure 6, be the another kind of structural representation of embodiment of the invention identity authorization system.

With embodiment illustrated in fig. 5 different be that in this embodiment, described voice signal receiving element 501 also is used for when the user registers, and receives the registration voice signal of described user's typing.

In addition, in this system, also further comprise: model construction unit 601, be used for making up described user's speaker model according to described registration voice signal, this model construction unit 601 comprises:

Feature extraction subelement 611 is used for extracting the vocal print feature from described registration voice signal;

Topological structure is determined subelement 612, all mixed Gauss models of the speaker model that is used for determining described user according to the number of repetition and the frame number of described registration voice signal;

Such as, the mixed Gauss model number that can set described user's speaker model is less than or equal to the number of repetition of described registration voice signal; Setting is less than or equal to the frame number of described registration voice signal corresponding to the Gaussage of each mixed Gauss model;

The first estimator unit 613, the vocal print feature that is used to utilize feature extraction subelement 611 to extract estimate that described topological structure determines the Gaussian mean parameter of all mixed Gauss models that subelement 612 is determined;

The second estimator unit 614, the vocal print feature that is used to utilize feature extraction subelement 611 to extract estimate that described topological structure determines Gauss's variance parameter of all mixed Gauss models that subelement 612 is determined.

Above-mentioned each estimator unit can not repeat them here with reference to the description of front the method for estimation of the relevant parameter in the mixed Gauss model.

The identity authorization system of the embodiment of the invention, can be according to the sentence number and the long topological structure that the number of the mixed Gauss model in the speaker model is set and determines model of sentence of registration voice, by reasonable setting to Gaussian mean, variance and the weight coefficient of all mixed Gauss models, solved the sparse training problem of data that exists in the system of tradition based on the vocal print cipher authentication effectively, improve the differentiation between the mixed Gauss model, and then can improve the accuracy rate of authentication.And the mixed Gauss model of use is littler more effective, with respect to prior art, has improved the required memory pressure of arithmetic speed and storage data greatly.

Each embodiment in this specification all adopts the mode of going forward one by one to describe, and identical similar part is mutually referring to getting final product between each embodiment, and each embodiment stresses all is difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to method embodiment, so describe fairly simplely, relevant part gets final product referring to the part explanation of method embodiment.System embodiment described above only is schematically, and wherein said unit and module as the separating component explanation can or can not be physically to separate also.In addition, can also select wherein some or all of unit and the module purpose that realizes the present embodiment scheme according to the actual needs.Those of ordinary skills promptly can understand and implement under the situation of not paying creative work.

More than disclosed only be preferred implementation of the present invention; but the present invention is not limited thereto; any those skilled in the art can think do not have a creationary variation, and, all should drop in protection scope of the present invention not breaking away from some improvements and modifications of being done under the principle of the invention prerequisite.

Claims

1. an identity identifying method is characterized in that, comprising:

2. the method for claim 1 is characterized in that, the likelihood score of the speaker model of described vocal print characteristic sequence of described calculating and described current login user comprises:

Calculate the likelihood score of described vocal print characteristic sequence and each mixed Gauss model respectively;

Determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

3. method as claimed in claim 2 is characterized in that, the described likelihood score that calculates described vocal print characteristic sequence and each mixed Gauss model respectively comprises:

Calculate in the described vocal print characteristic sequence likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model respectively;

Select the likelihood score of the time average of the likelihood score summation that mixed Gauss model of one group of vocal print feature correspondence calculates in the described vocal print characteristic sequence as described vocal print characteristic sequence and this mixed Gauss model.

4. method as claimed in claim 2 is characterized in that, describedly determines that according to result of calculation the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user comprises:

Select the likelihood score of the mean value of the likelihood score that corresponding all mixed Gauss models of described vocal print characteristic sequence calculate as the speaker model of described vocal print characteristic sequence and described current login user; Perhaps

Select the likelihood score of the maximum of the likelihood score that corresponding all mixed Gauss models of described vocal print characteristic sequence calculate as the speaker model of described vocal print characteristic sequence and described current login user.

5. the method for claim 1 is characterized in that, the likelihood score of the speaker model of described vocal print characteristic sequence of described calculating and described current login user comprises:

Calculate in the described vocal print characteristic sequence each vocal print feature respectively with respect to the likelihood score of described polyhybird Gauss model;

6. method as claimed in claim 5 is characterized in that, describedly calculates each vocal print feature in the described vocal print characteristic sequence respectively and comprises with respect to the likelihood score of described polyhybird Gauss model:

Select in the described vocal print characteristic sequence maximum in each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature the likelihood score as the likelihood score of this vocal print feature and described polyhybird Gauss model; Perhaps, select in the described vocal print characteristic sequence likelihood score of the mean value of all likelihood scores that each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature as this vocal print feature and described polyhybird Gauss model.

7. method as claimed in claim 5 is characterized in that, describedly determines that according to result of calculation the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user comprises:

Select the likelihood score of the time average of the likelihood score that the corresponding polyhybird Gauss model of all vocal print features calculates in the described vocal print characteristic sequence as the speaker model of described vocal print characteristic sequence and described current login user.

8. as each described method of claim 1 to 7, it is characterized in that described method also comprises:

When the user registers, receive the registration voice signal of described user's typing;

Make up described user's speaker model according to described registration voice signal, this process comprises:

From described registration voice signal, extract the vocal print feature;

All mixed Gauss models of the speaker model of determining described user according to the number of repetition and the frame number of described registration voice signal;

Estimate the Gaussian mean parameter of all mixed Gauss models of described user's speaker model according to the vocal print feature of from described registration voice signal, extracting;

Estimate Gauss's variance parameter of all mixed Gauss models of described user's speaker model according to the vocal print feature of from described registration voice signal, extracting.

9. method as claimed in claim 8 is characterized in that, all mixed Gauss models of the described speaker model of determining described user according to the number of repetition and the frame number of described registration voice signal comprise:

The mixed Gauss model number of setting described user's speaker model is less than or equal to the number of repetition of described registration voice signal;

Setting is less than or equal to the frame number of the registration voice signal of described mixed Gauss model correspondence corresponding to the Gaussage of each mixed Gauss model.

10. an identity authorization system is characterized in that, comprising:

11. system as claimed in claim 10 is characterized in that, described second computing unit comprises:

First computation subunit is used for calculating respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model;

First determines subelement, is used for determining according to the result of calculation of described first computation subunit likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

12. system as claimed in claim 11 is characterized in that, described first computation subunit comprises:

First computing module is used for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and the described polyhybird Gauss model;

First selects module, is used for selecting the likelihood score of the time average of the likelihood score summation that the corresponding mixed Gauss model of one group of vocal print feature of described vocal print characteristic sequence calculates as described vocal print characteristic sequence and this mixed Gauss model.

13. system as claimed in claim 11 is characterized in that,

Described first determines subelement, specifically is used to select the likelihood score of the mean value of the likelihood score that corresponding all mixed Gauss models of described vocal print characteristic sequence calculate as the speaker model of described vocal print characteristic sequence and described current login user; Perhaps, select the likelihood score of the maximum of the likelihood score that corresponding all mixed Gauss models of described vocal print characteristic sequence calculate as the speaker model of described vocal print characteristic sequence and described current login user.

14. system as claimed in claim 10 is characterized in that, described second computing unit comprises:

Second computation subunit is used for calculating respectively the likelihood score of each vocal print feature of described vocal print characteristic sequence with respect to described polyhybird Gauss model;

Second determines subelement, is used for determining according to the result of calculation of described second computation subunit likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

15. system as claimed in claim 14 is characterized in that, described second computation subunit comprises:

Second computing module is used for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and the described polyhybird Gauss model;

Second selects module, is used for selecting maximum in each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of vocal print feature of described vocal print characteristic sequence the likelihood score as the likelihood score of this vocal print feature and described polyhybird Gauss model; Perhaps select in the described vocal print characteristic sequence likelihood score of the mean value of all likelihood scores that each mixed Gauss model calculates in the corresponding described polyhybird Gauss model of a vocal print feature as this vocal print feature and described polyhybird Gauss model.

16. system as claimed in claim 14 is characterized in that,

Described second determines subelement, and specifically being used for selecting each vocal print feature of described vocal print characteristic sequence is the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user with respect to the time average of the likelihood score of described polyhybird Gauss model.

17. as each described system of claim 10 to 16, it is characterized in that,

Described voice signal receiving element also is used for when the user registers, and receives the registration voice signal of described user's typing;

Described system also comprises: the model construction unit, be used for making up described user's speaker model according to described registration voice signal, and described model construction unit comprises:

The feature extraction subelement is used for extracting the vocal print feature from described registration voice signal;

Topological structure is determined subelement, all mixed Gauss models of the speaker model that is used for determining described user according to the number of repetition and the frame number of described registration voice signal;

The first estimator unit, the vocal print feature that is used to utilize described feature extraction subelement to extract estimate that described topological structure determines the Gaussian mean parameter of all mixed Gauss models that subelement is determined;

The second estimator unit, the vocal print feature that is used to utilize described feature extraction subelement to extract estimate that described topological structure determines Gauss's variance parameter of all mixed Gauss models that subelement is determined.

18. system as claimed in claim 17 is characterized in that,

Described topological structure is determined subelement, and the mixed Gauss model number that specifically is used to set described user's speaker model is less than or equal to the number of repetition of described registration voice signal; Setting is less than or equal to the frame number of the registration voice signal of described mixed Gauss model correspondence corresponding to the Gaussage of each mixed Gauss model.