CN102238190B

CN102238190B - Identity authentication method and system

Info

Publication number: CN102238190B
Application number: CN2011102180452A
Authority: CN
Inventors: 潘逸倩; 胡国平; 何婷婷; 魏思; 胡郁; 王智国; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2011-08-01
Filing date: 2011-08-01
Publication date: 2013-12-11
Anticipated expiration: 2031-08-01
Also published as: CN102238190A

Abstract

The invention discloses an identity authentication method and an identity authentication system. The method comprises the following steps of: in the login of a user, receiving a continuous voice signal recorded by the current login user; extracting a voiceprint characteristic sequence from the continuous voice signal; computing likelihood between the voiceprint characteristic sequence and a background model; computing the likelihood between the voiceprint characteristic sequence and a speaker model of the current login user, wherein the speaker model is a polyhybrid Gaussian model constructed according to the repetition times and frame number of the registration voice signals recorded in the login of the current login user; computing a likelihood ratio according to the likelihood between the voiceprint characteristic sequence and the speaker model and the likelihood between the voiceprint characteristic sequence and the background model; and if the likelihood ratio is greater than a preset threshold value, determining the current login user is an effectively authenticated user, otherwise determining the current login user is an unauthenticated user. By the method and the system, the voiceprint-password-based identity authentication accuracy can be improved.

Description

Identity identifying method and system

Technical field

The present invention relates to identity identification technical field, particularly a kind of identity identifying method and system.

Background technology

Application on Voiceprint Recognition (Voiceprint Recognition, VPR), also referred to as Speaker Identification, has two classes, i.e. speaker's identification and speaker verification.The former,, in order to judge that certain section voice are which in some people is said, is " multiselect one " problem; And the latter is " differentiating one to one " problem in order to confirm that whether certain section voice are that the someone of appointment is said.Different tasks can be used different sound groove recognition technology in es with application.

Voiceprint refers to according to the voice signal collected confirms speaker ' s identity, belongs to the discrimination of " one to one ".The voiceprint authentication system of main flow has adopted the framework based on hypothesis testing now, by calculate respectively the vocal print signal with respect to the likelihood score of speaker model and background model and relatively they likelihood ratio and in advance rule of thumb the threshold size of setting confirm.Obviously the accuracy of background model and speaker model will directly have influence on the voiceprint effect, and under the statistical model setting driven at based on data, more modelling effect is better for amount of training data.

The vocal print cipher authentication is a kind of speaker ' s identity authentication method of text-dependent.The method requires the user speech input to determine cryptogram, and confirms accordingly speaker ' s identity.In this application, user's registration and authentication all adopt the phonetic entry of determining cryptogram, thereby its vocal print is often more consistent, can obtain accordingly than the speaker verification of text-independent and better authenticate effect.

Now the vocal print cipher authentication system the most the technology path of main flow be the GMM-UBM algorithm, adopted respectively mixed Gaussian (Gaussian Mixture Model, GMM) modeling background model (Universal Background Model, UBM) and speaker model.The UBM model is for describing the general character of speaker's vocal print.Because each speaker's vocal print always has specificity separately, the corresponding UBM model based on many speakers training data needs complicated model structure to meet the matching requirement of distribution separate data.The UBM model is selected the GMM model of 1024 even larger Gaussages usually at present.

By system, when the user registers, according to the registration voice, online training obtains speaker model.Because registration is often limited with speech samples, directly train accordingly complex model because Sparse easily causes the problems such as model is accurate not.For this reason, in the prior art, normally take background model as initial model by various adaptive approachs according to a small amount of speaker's data adjustment model partial parameters, as at present commonly used based on maximum a posteriori probability (Maximum A Posterior, MAP) adaptive algorithms etc., should be current speaker's individual character by user's vocal print general character is adaptive.

Form man-to-man corresponding relation between each Gauss of speaker's mixed Gauss model and common background Gauss model under the adaptive updates algorithm, therefore, make the speaker model parameter too much, easily cause following problem in the log-on data amount in less vocal print cipher authentication system:

1. model redundancy: in the vocal print cipher authentication system, speaker model is to be obtained by several sample datas training that repeat all over the registration speech cipher.Very few sample data causes adaptive algorithm can only upgrade part Gauss in initial back-ground model, and has much all retained and the similar Gaussian component of background model.The existence of redundancy model parameter easily causes the increase of storage and computing pressure, and then the efficiency of impact decoding.

2. the model training amount is larger: in adaptive algorithm, and each Gauss's of 1024 even larger Gaussages of needs calculating initial back-ground model sample statistic, and its parameter is upgraded.

3. in adaptive algorithm, because the variance revaluation of speaker model is comparatively difficult, thereby often directly adopt the variance of background model.Because background model is based on the model of the simulation vocal print general character that many speakers training data obtains, its model probability distribution variance is often larger.And the characteristics of the specific vocal print of speaker of the variance of speaker model simulation have specificity.Directly by the background model variance, can not embody well the speaker model characteristics, reduce the differentiation between different speaker models, thereby affected recognition accuracy.

Summary of the invention

The embodiment of the present invention provides a kind of identity identifying method and system, to improve the accuracy rate of based on the vocal print password, carrying out authentication.

The embodiment of the present invention provides a kind of identity identifying method on the one hand, comprising:

When the user logins, receive the continuous speech signal of current login user typing;

Extract the vocal print characteristic sequence in described continuous speech signal, described vocal print characteristic sequence comprises one group of vocal print feature;

Calculate the likelihood score of described vocal print characteristic sequence and background model;

Calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number builds;

According to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, calculate likelihood ratio;

If described likelihood ratio is greater than the threshold value of setting, determines that described current login user is effective authenticated user, otherwise determine that described current login user is non-authenticated user.

The embodiment of the present invention provides a kind of identity authorization system on the other hand, comprising:

The voice signal receiving element, for when the user logins, receive the continuous speech signal of current login user typing;

Extraction unit, for extracting the vocal print characteristic sequence of described continuous speech signal, described vocal print characteristic sequence comprises one group of vocal print feature;

The first computing unit, for calculating the likelihood score of described vocal print characteristic sequence and background model;

The second computing unit, for calculating the likelihood score of speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model of frame number structure;

The 3rd computing unit, for the likelihood score of the likelihood score according to described vocal print characteristic sequence and speaker model and described vocal print characteristic sequence and background model, calculate likelihood ratio;

Judging unit, while for the likelihood ratio calculated at described the 3rd computing unit, being greater than the threshold value of setting, determine that described current login user is effective authenticated user, otherwise determine that described current login user is non-authenticated user.

The identity identifying method that the embodiment of the present invention provides and system, according to the vocal print characteristic sequence in the continuous speech signal of current login user typing, calculate respectively the speaker model of vocal print characteristic sequence and current login user and the likelihood score of background model, then calculate likelihood ratio, according to the likelihood ratio obtained, determine whether current login user is effective authenticated user.Due in this scheme, the speaker model used is the polyhybird Gauss model that the voice signal of typing builds during according to the registration of current login user, thereby can simulate the characteristics that described user says the difference pronunciation variation of same voice signal (being password) existence, improve the accuracy rate of carrying out authentication based on the vocal print password.

The accompanying drawing explanation

In order to be illustrated more clearly in technical scheme of the invention process, below will the accompanying drawing of required use in embodiment be briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the flow chart of embodiment of the present invention identity identifying method;

Fig. 2 is a kind of flow chart of background model parameters training process in the embodiment of the present invention;

Fig. 3 is that traditional adaptive algorithm of utilizing builds the flow chart of speaker model;

Fig. 4 builds the flow chart of speaker model in the embodiment of the present invention;

Fig. 5 is a kind of structural representation of embodiment of the present invention identity authorization system;

Fig. 6 is the another kind of structural representation of embodiment of the present invention identity authorization system.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, belong to the scope of protection of the invention.

As shown in Figure 1, be the flow chart of embodiment of the present invention identity identifying method, comprise the following steps:

Step 101, when the user logins, receive the continuous speech signal of current login user typing.

Step 102, extract the vocal print characteristic sequence in described continuous speech signal.

This vocal print characteristic sequence comprises one group of vocal print feature, can effectively distinguish different speakers, and same speaker's variation is kept relative stability.

Described vocal print feature mainly contains: spectrum envelop parameter phonetic feature, pitch contour, formant frequency bandwidth feature, linear predictor coefficient, cepstrum coefficient etc.Consider the quantification property of above-mentioned vocal print feature, the quantity of training sample and the problems such as evaluation of systematic function, can select MFCC (Mel Frequency Cepstrum Coefficient, the Mel frequency cepstral coefficient) feature, every frame speech data that the long 25ms frame of window is moved to 10ms is done short-time analysis and is obtained MFCC parameter and single order second differnce thereof, amounts to 39 dimensions.Like this, every voice signal can be quantified as one 39 dimension vocal print feature vector sequence X.

Step 103, calculate the likelihood score of described vocal print characteristic sequence and background model.

The vocal print feature vector sequence X that frame number is T corresponding to the likelihood score of background model (UBM) is:

p (X | UBM) = \frac{1}{T} Σ_{t = 1}^{T} Σ_{m = 1}^{M} c_{m} N (X_{t}; μ_{m}, Σ_{m}) - - - (1)

Wherein, c _mbe m Gauss's weight coefficient, meet μ _mand ∑ _mrespectively m Gauss's average and variance.Wherein N (.) meets normal distribution, for calculating t vocal print characteristic vector X constantly _tlikelihood score on single Gaussian component:

N (X_{t}; μ_{m}, Σ_{m}) = \frac{1}{\sqrt{{(2 π)}^{n} | Σ_{m} |}} e^{- \frac{1}{2} {(X_{t} - μ_{m})}^{'} {Σ_{m}}^{- 1} (X_{t} - μ_{m})} - - - (2)

Step 104, calculate the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model that frame number builds.

Because speaker model is the polyhybird Gauss model that the voice signal of typing builds during according to described current login user registration, therefore, in this step, while calculating the likelihood score of speaker model of described vocal print characteristic sequence and described current login user, need to calculate respectively the likelihood score of each vocal print feature and each mixed Gauss model in described vocal print characteristic sequence; Then determine the likelihood score of the speaker model of described vocal print feature and described current login user according to all likelihood scores that calculate.Specifically multiple implementation can be arranged, such as:

1. first calculate respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model, and then determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

In this mode, can calculate respectively in described vocal print characteristic sequence the likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model; Select the likelihood score of the time average of the likelihood score summation that in described vocal print characteristic sequence, mixed Gauss model of one group of vocal print feature correspondence calculates as described vocal print characteristic sequence and this mixed Gauss model.

And, after the likelihood score that obtains described vocal print characteristic sequence and each mixed Gauss model, can select one of them maximum or the average likelihood score as the speaker model of described vocal print characteristic sequence and described current login user.

2. first calculate respectively in described vocal print characteristic sequence each vocal print feature with respect to the likelihood score of described polyhybird Gauss model, and then determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

In this mode, can calculate respectively in described vocal print characteristic sequence the likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model; Select in described vocal print characteristic sequence maximum in likelihood score that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model the calculates likelihood score as this vocal print feature and described polyhybird Gauss model; Perhaps, select in described vocal print characteristic sequence the likelihood score of the mean value of all likelihood scores that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model calculates as this vocal print feature and described polyhybird Gauss model.

And in obtaining described vocal print feature after the likelihood score of each vocal print feature and polyhybird Gauss model, select the summation time average of all vocal print feature likelihood scores of vocal print characteristic sequence as the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

Certainly, other selection modes can also be arranged, such as all likelihood scores to calculating are weighted on average etc., this embodiment of the present invention not done to restriction.

Step 105, according to the likelihood score of described vocal print characteristic sequence and speaker model and the likelihood score of described vocal print characteristic sequence and background model, calculate likelihood ratio.

Likelihood ratio is:

p = \frac{p (X | U)}{p (X | UBM)} - - - (3)

Wherein, the likelihood score that p (X|U) is described vocal print feature and speaker model, the likelihood score that p (X|UBM) is described vocal print feature and background model.

Step 106, judge whether described likelihood ratio is greater than the threshold value of setting, if so, performs step 107; Otherwise, perform step 108.

Above-mentioned threshold value can be preset by system, in general, this threshold value is larger, the sensitivity of system is higher, require user's pronunciation of the voice signal (password) of typing during as far as possible according to registration when login, otherwise the sensitivity of system is lower, there is certain variation in pronunciation when while allowing the user to login, the pronunciation of the voice signal of typing is with registration.

Step 107, determine that described current login user is effective authenticated user.

Step 108, determine that described current login user is non-authenticated user.

It should be noted that, in order to improve the robustness of system, before above-mentioned steps 101 and step 102, can also carry out noise reduction process to described continuous speech signal, such as, at first by short-time energy and short-time zero-crossing rate analysis to voice signal, continuous voice signal is divided into to independently voice snippet and non-voice segment.Then reduce the interference of channel noise and background noise by the front end noise reduction process, improve the voice signal to noise ratio, for follow-up system, process clean signal is provided.

The existing relative stability of user's vocal print feature, have again variability.Easily be subject on the one hand the impact of health, age, mood etc., easily be subject on the other hand the interference of external environment noise and voice collecting channel, so speaker model needs the different vocal prints that can distinguish preferably same speaker to change.In embodiments of the present invention, speaker model is the polyhybird Gauss model that the voice signal of typing builds during according to described current login user registration, when the Gaussage of mixed Gauss model number and each mixed Gauss model is registered to the user, the frame number of the number of repetition of the voice signal of typing and this voice signal is relevant, thereby the characteristics that can utilize a plurality of mixed Gauss model analog subscribers to change the difference pronunciation of saying same password (being above-mentioned voice signal) existence, improved the accuracy rate of carrying out authentication based on the vocal print password.

In embodiments of the present invention, background model is for describing the general character of speaker's vocal print, this background model need to build in advance, specifically can adopt modes more of the prior art, such as, the mixed Gauss model simulation background model of employing 1024 or larger Gaussage, its model parameter training process as shown in Figure 2.

Step 201 is extracted respectively the vocal print feature from many speakers training utterance signal, and each vocal print feature is as a characteristic vector.

Step 202, utilize clustering algorithm to carry out cluster to above-mentioned characteristic vector, obtains K Gauss's initialization average, and K is the mixed Gauss model number set in advance.

Such as, can adopt traditional LBG (Linde, Buzo, Gray) clustering algorithm, approach optimum regeneration code book by trained vector collection and certain iterative algorithm.

Step 203, utilize EM (Expectation Maximization) algorithm iteration to upgrade above-mentioned average, variance and weight coefficient corresponding to each Gauss, obtains background model.

Concrete iteration renewal process is same as the prior art, at this, is not described in detail.

Certainly, can also adopt other modes to build background model, this embodiment of the present invention is not done to restriction.

In embodiments of the present invention, needing to distinguish the user is in login mode or registration mode, if login mode, need to carry out the authentication based on the vocal print password to this user according to the flow process shown in Fig. 1, if registration mode, need to receive the registration voice signal of described user's typing, and build described user's speaker model according to described registration voice signal.

In the embodiment of the present invention, the building process of the building process of speaker model and traditional speaker model is diverse, for this point is described better, below at first the building process of traditional speaker model is done to simple declaration.

The building process of traditional speaker model is to take background model as initial model, by adaptive approach adjustment model partial parameters, as the adaptive algorithm based on maximum a posteriori probability at present commonly used etc.Adaptive algorithm should be current speaker's individual character according to a small amount of speaker's data by user's vocal print general character is adaptive, and it specifically trains flow process as shown in Figure 3, comprises the following steps:

Step 301 is extracted the vocal print feature from the registration voice signal of user's typing.

Step 302, utilize the average μ of described vocal print feature adaptive updates background model mixed Gaussian _m.

Particularly, new Gaussian mean μ _mbe calculated as the weighted average of sample statistic and original Gaussian mean, that is:

\hat{μ_{m}} = \frac{Σ_{t = 1}^{T} γ_{m} (x_{t}) x_{t} + τ μ_{m}}{Σ_{t = 1}^{T} γ_{m} (x_{t}) + τ} - - - (4)

Wherein, x _tmean t frame vocal print feature, γ _m(x _t) meaning that t frame vocal print feature falls within m Gauss's probability, τ is forgetting factor, the update intensity for the historical average of balance and sample to new average.In general, the τ value is larger, and new average is restricted by original average mainly.And if the τ value is less, new average is mainly determined by sample statistic, has more embodied the characteristics that new samples distributes.

Step 303, copy the speaker model variance of background model variance as described user.

Step 304, generate described user's speaker model.

In embodiments of the present invention, need to when the user registers, receive the registration voice signal of described user's typing, and build described user's speaker model according to described registration voice signal.This speaker model consists of a plurality of mixed Gauss models, the characteristics that the difference pronunciation of saying same password and existing changed with the simulation speaker, and, illustrate that in human model, each mixed Gauss model is trained separately variance, in the solution conventional method, directly to copy the background model variance, cause variance excessive, do not meet the problem of practical application.

As shown in Figure 4, be to build the flow chart of speaker model in the embodiment of the present invention, comprise the following steps:

Step 401, the registration voice signal that user recording is entered saves as a conventional sequence.

Suppose that the user registers the same password content N of input (such as N=2,3 etc.) inferior, obtain N independently conventional sequence.

Step 402 is extracted the vocal print feature from the conventional sequence obtained.

The step 102 of detailed process and front is similar, at this, is not described in detail.

Step 403, all mixed Gauss models of the speaker model of determining described user according to number of repetition and the frame number of described registration voice signal.

In the vocal print cipher application, the user inputs unified content of text and uses as password.Such as, the mixed Gauss model number that can set described user's speaker model equals the number of repetition of described registration voice signal, and set the frame number that equals the registration voice signal that described mixed Gauss model is corresponding corresponding to the Gaussage of each mixed Gauss model, specifically can be expressed as:

p (O | M_{k}) = Σ_{m = 1}^{T (k)} c_{m}^{k} N (O; μ_{m}^{k}, Σ_{m}^{k}) - - - (5)

Wherein, T (k) is mixed Gauss model M _kgaussage, be equal to the frame number of k the speech samples that model is corresponding.And

respectively mixed Gauss model M _kweight coefficient, average and the variance of m Gaussian component.

Certainly, the embodiment of the present invention does not limit the topological mode of above-mentioned speaker model, the Gaussage of its mixed Gauss model number and each mixed Gauss model can be not and number of repetition and the complete correspondent equal of frame number of described voice signal yet, also can choose the number of repetition that the mixed Gauss model number is less than described registration voice signal by adopting clustering algorithm, equally, the Gaussage of each mixed Gauss model also can be less than the frame number of described registration voice signal.

Step 404, estimate the Gaussian mean parameter of all mixed Gauss models according to the vocal print feature of extracting.

In embodiments of the present invention, determine the Gaussian mean parameter of its corresponding mixed Gauss model according to single training sample.Particularly, each Gaussian mean vector that can mixed Gauss model is set to the Characteristic Vectors value of sample,

wherein the average that means m Gauss of k mixed model, and

the vocal print characteristic vector that means the m frame voice of k voice signal.

Step 405, estimate Gauss's variance parameter of all mixed Gauss models according to the vocal print feature of extracting.

Can suppose that in speaker model, a plurality of Gausses of each mixed Gaussian have for the unified matrix of the overall situation, to realize the variance revaluation problem on less data.Under this hypothesis,

(the covariance matrix of all Gaussian component of k mixed Gauss model has identical matrix numerical value).Particularly, to given sample vocal print characteristic sequence O _k, according to

the i.e. statistical information of all residue sample vocal print characteristic sequences, revaluation mixed Gauss model M ^kvariance, be calculated as follows:

Σ^{k} = \frac{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} (γ_{m}^{k} (O_{i}^{n}) (O_{i}^{n} - μ_{m}^{k}) {(O_{i}^{n} - μ_{m}^{k})}^{T})}{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} γ_{m}^{k} (O_{i}^{n})} - - - (6)

Wherein,

i the speech frame (being sample) that means n sentence log-in password (registering voice signal),

m the Gaussian mean that means k mixed Gauss model,

mean sample

dropping on average is

gauss on probability.

Like this, to each independent mixed Gaussian M of speaker model ^k, can utilize non-O ^ksample data obtain corresponding variance parameter.If register voice signal as the N sentence, obtain N different variance matrix.

Especially, can suppose that this variance matrix is that diagonal matrix is further to reduce Sparse Problem,

a plurality of Gausses' the variance that can also further consider in addition a plurality of mixed Gauss models of speaker model has the unified diagonal matrix of the overall situation, to solve better the model variance revaluation problem in the Sparse situation.Under this hypothesis,

Step 406, estimate Gauss's weight coefficient parameter of all mixed Gauss models.

The Gaussian mean of considering mixed Gauss model in the present embodiment directly determined by the sample vector, thereby each Gauss exists with 1 probability on sample, and probability of occurrence is identical.The weight coefficient equalization of each Gauss in mixed model can be set for this reason, that is: in the present embodiment

c_{m}^{k} = c^{k} = \frac{1}{T (k)} - - - (7)

Utilize flow process shown in above-mentioned Fig. 4, can be according to sentence number and the long topological structure that the number of the mixed Gauss model in speaker model is set and determines model of sentence of registration voice, reasonable setting by the Gaussian mean to all mixed Gauss models, variance and weight coefficient, effectively solved the training problem of the Sparse existed in the system of tradition based on the vocal print cipher authentication, improve the differentiation between mixed Gauss model, and then can improve the accuracy rate of authentication.And the mixed Gauss model of use is less more effective, with respect to prior art, greatly improved the required memory pressure of arithmetic speed and storage data.

Correspondingly, the embodiment of the present invention also provides a kind of identity authorization system, as shown in Figure 5, is a kind of structural representation of embodiment of the present invention identity authorization system.

In this embodiment, described system comprises:

Voice signal receiving element 501, for when the user logins, receive the continuous speech signal of current login user typing;

Extraction unit 502, for extracting the vocal print characteristic sequence of described continuous speech signal;

The first computing unit 503, for calculating the likelihood score of described vocal print characteristic sequence and background model;

The second computing unit 504, for calculating the likelihood score of speaker model of described vocal print characteristic sequence and described current login user, described speaker model is the number of repetition of the registration voice signal of typing during according to described current login user registration and the polyhybird Gauss model of frame number structure;

The 3rd computing unit 505, for the likelihood score of the likelihood score according to described vocal print characteristic sequence and speaker model and described vocal print characteristic sequence and background model, calculate likelihood ratio;

Judging unit 506, while for the likelihood ratio calculated at described the 3rd computing unit 505, being greater than the threshold value of setting, determine that described current login user is effective authenticated user, otherwise determine that described current login user is non-authenticated user.

Above-mentioned this vocal print characteristic sequence comprises one group of vocal print feature, can effectively distinguish different speakers, and same speaker's variation is kept relative stability.

Such as, the vocal print feature that extraction unit 502 can extract mainly contains: spectrum envelop parameter phonetic feature, pitch contour, formant frequency bandwidth feature, linear predictor coefficient, cepstrum coefficient etc.Consider the quantification property of above-mentioned vocal print feature, the quantity of training sample and the problems such as evaluation of systematic function, can select MFCC (Mel Frequency Cepstrum Coefficient, the Mel frequency cepstral coefficient) feature, every frame speech data that the long 25ms frame of window is moved to 10ms is done short-time analysis and is obtained MFCC parameter and single order second differnce thereof, amounts to 39 dimensions.Like this, every voice signal can be quantified as one 39 dimension vocal print characteristic sequence X.

The above-mentioned background model can be that system builds in advance and is written into when initialization, and the concrete building process embodiment of the present invention of background model is not done restriction.

Above-mentioned speaker model is the polyhybird Gauss model that the voice signal of typing builds during according to described current login user registration, and correspondingly, in embodiments of the present invention, above-mentioned the second computing unit 504 can have multiple implementation, such as:

In one implementation, described the second computing unit 504 comprises: the first computation subunit and first is determined subelement.Wherein:

Described the first computation subunit, for calculating respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model;

Described first determines subelement, determines the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user for the result of calculation according to described the first computation subunit.

Above-mentioned the first computation subunit can comprise: the first computing module and first is selected module, wherein:

Described the first computing module, for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and described polyhybird Gauss model;

Described first selects module, the likelihood score for the time average of the likelihood score summation of selecting the corresponding mixed Gauss model of one group of vocal print feature of described vocal print characteristic sequence to calculate as described vocal print characteristic sequence and this mixed Gauss model.

Correspondingly, above-mentioned first determines that subelement also can have multiple implementation, such as, obtain the likelihood score of described vocal print characteristic sequence and each mixed Gauss model in the first computation subunit after, first determines that subelement can select one of them maximum or the average likelihood score as the speaker model of described vocal print characteristic sequence and described current login user.

In another kind of implementation, described the second computing unit 504 comprises: the second computation subunit and second is determined subelement.Wherein:

Described the second computation subunit, for calculating respectively the likelihood score of each vocal print feature of described vocal print characteristic sequence with respect to described polyhybird Gauss model;

Described the second chooser unit, determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user for the result of calculation according to described the second computation subunit.

Above-mentioned the second computation subunit can comprise: the second computing module and second is selected module, wherein:

Described the second computing module, for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and described polyhybird Gauss model;

Described second selects module, for selecting maximum in likelihood score that in the corresponding described polyhybird Gauss model of vocal print feature of described vocal print characteristic sequence, each mixed Gauss model the calculates likelihood score as this vocal print feature and described polyhybird Gauss model; Perhaps select in described vocal print characteristic sequence the likelihood score of the mean value of all likelihood scores that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model calculates as this vocal print feature and described polyhybird Gauss model.

Correspondingly, above-mentioned second determines that subelement also can have multiple implementation, such as, in the second computation subunit, obtain in described vocal print characteristic sequence after the likelihood score of each vocal print feature with respect to described polyhybird Gauss model, second determines that subelement can select the likelihood score of the speaker model that each vocal print feature in described vocal print characteristic sequence is described vocal print characteristic sequence and described current login user with respect to the time average of the likelihood score of described polyhybird Gauss model.

Certainly, the second computing unit 504 can also adopt other modes to realize, this embodiment of the present invention is not done to restriction.

The concrete computational process of above-mentioned the first computing unit 503, the second computing unit 504 and the 3rd computing unit 505 can, with reference to the description in the embodiment of the present invention identity identifying method of front, not repeat them here.

In embodiments of the present invention, speaker model is the polyhybird Gauss model that the voice signal of typing builds during according to described current login user registration, when the Gaussage of mixed Gauss model number and each mixed Gauss model is registered to the user, the frame number of the number of repetition of the voice signal of typing and this voice signal is relevant, thereby the characteristics that can utilize a plurality of mixed Gauss model analog subscribers to change the difference pronunciation of saying same password (being above-mentioned voice signal) existence, improved the accuracy rate of carrying out authentication based on the vocal print password.

As shown in Figure 6, be the another kind of structural representation of embodiment of the present invention identity authorization system.

From embodiment illustrated in fig. 5 different, in this embodiment, described voice signal receiving element 501, also for when the user registers, receives the registration voice signal of described user's typing.

In addition, in this system, also further comprise: model construction unit 601, for build described user's speaker model according to described registration voice signal, this model construction unit 601 comprises:

Feature extraction subelement 611, for extracting the vocal print feature from described registration voice signal;

Topological structure is determined subelement 612, determines all mixed Gauss models of described user's speaker model for the number of repetition according to described registration voice signal and frame number;

Such as, the mixed Gauss model number that can set described user's speaker model is less than or equal to the number of repetition of described registration voice signal; Setting is less than or equal to the frame number of described registration voice signal corresponding to the Gaussage of each mixed Gauss model;

The first estimator unit 613, estimate that for the vocal print feature of utilizing feature extraction subelement 611 to extract described topological structure determines the Gaussian mean parameter of all mixed Gauss models that subelement 612 is determined;

The second estimator unit 614, estimate that for the vocal print feature of utilizing feature extraction subelement 611 to extract described topological structure determines Gauss's variance parameter of all mixed Gauss models that subelement 612 is determined.

Above-mentioned each estimator unit can, with reference to the description of front, not repeat them here the method for estimation of the relevant parameter in mixed Gauss model.

The identity authorization system of the embodiment of the present invention, can be according to sentence number and the long topological structure that the number of the mixed Gauss model in speaker model is set and determines model of sentence of registration voice, reasonable setting by the Gaussian mean to all mixed Gauss models, variance and weight coefficient, effectively solved the training problem of the Sparse existed in the system of tradition based on the vocal print cipher authentication, improve the differentiation between mixed Gauss model, and then can improve the accuracy rate of authentication.And the mixed Gauss model of use is less more effective, with respect to prior art, greatly improved the required memory pressure of arithmetic speed and storage data.

Each embodiment in this specification all adopts the mode of going forward one by one to describe, and between each embodiment, identical similar part is mutually referring to getting final product, and each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, due to it, substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part gets final product referring to the part explanation of embodiment of the method.System embodiment described above is only schematically, and wherein said unit and module as the separating component explanation can or can not be also physically to separate.In addition, the purpose that can also select according to the actual needs some or all of unit wherein and module to realize the present embodiment scheme.Those of ordinary skills in the situation that do not pay creative work, can understand and implement.

Above disclosed be only the preferred embodiment of the present invention; but the present invention is not limited thereto; any those skilled in the art can think there is no a creationary variation, and some improvements and modifications of doing without departing from the principles of the present invention, all should drop in protection scope of the present invention.

Claims

1. an identity identifying method, is characterized in that, comprising:

According to the likelihood score p (X|U) of described vocal print characteristic sequence and speaker model and the likelihood score p (X|UBM) of described vocal print characteristic sequence and background model, calculate likelihood ratio

2. the method for claim 1, is characterized in that, the likelihood score of the speaker model of the described vocal print characteristic sequence of described calculating and described current login user comprises:

Calculate respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model;

Determine the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user according to result of calculation.

3. method as claimed in claim 2, is characterized in that, the described likelihood score that calculates respectively described vocal print characteristic sequence and each mixed Gauss model comprises:

Calculate respectively in described vocal print characteristic sequence the likelihood score of each mixed Gauss model in each vocal print feature and described polyhybird Gauss model;

Select the likelihood score of the time average of the likelihood score summation that in described vocal print characteristic sequence, mixed Gauss model of one group of vocal print feature correspondence calculates as described vocal print characteristic sequence and this mixed Gauss model.

4. method as claimed in claim 2, is characterized in that, describedly according to result of calculation, determines that the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user comprises:

The mean value of the likelihood score of selecting the corresponding all mixed Gauss models of described vocal print characteristic sequence to calculate is as the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user; Perhaps

The maximum of the likelihood score of selecting the corresponding all mixed Gauss models of described vocal print characteristic sequence to calculate is as the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

5. the method for claim 1, is characterized in that, the likelihood score of the speaker model of the described vocal print characteristic sequence of described calculating and described current login user comprises:

Calculate respectively in described vocal print characteristic sequence each vocal print feature with respect to the likelihood score of described polyhybird Gauss model;

6. method as claimed in claim 5, is characterized in that, describedly calculates respectively each vocal print feature in described vocal print characteristic sequence and comprise with respect to the likelihood score of described polyhybird Gauss model:

Select in described vocal print characteristic sequence maximum in likelihood score that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model the calculates likelihood score as this vocal print feature and described polyhybird Gauss model; Perhaps, select in described vocal print characteristic sequence the likelihood score of the mean value of all likelihood scores that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model calculates as this vocal print feature and described polyhybird Gauss model.

7. method as claimed in claim 5, is characterized in that, describedly according to result of calculation, determines that the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user comprises:

Select the likelihood score of the time average of the likelihood score that in described vocal print characteristic sequence, the corresponding polyhybird Gauss model of all vocal print features calculates as the speaker model of described vocal print characteristic sequence and described current login user.

8. method as described as claim 1 to 7 any one, is characterized in that, described method also comprises:

When the user registers, receive the registration voice signal of described user's typing;

Build described user's speaker model according to described registration voice signal;

The described process that builds described user's speaker model according to described registration voice signal comprises:

Extract the vocal print feature from described registration voice signal;

All mixed Gauss models of the speaker model of determining described user according to number of repetition and the frame number of described registration voice signal;

Estimate the Gaussian mean parameter of all mixed Gauss models of described user's speaker model according to the vocal print feature of extracting from described registration voice signal;

Estimate Gauss's variance parameter of all mixed Gauss models of described user's speaker model according to the vocal print feature of extracting from described registration voice signal.

9. method as claimed in claim 8, is characterized in that, the described number of repetition according to described registration voice signal and frame number determine that all mixed Gauss models of described user's speaker model comprise:

The mixed Gauss model number of setting described user's speaker model is less than or equal to the number of repetition of described registration voice signal;

Setting is less than or equal to the frame number of the registration voice signal that described mixed Gauss model is corresponding corresponding to the Gaussage of each mixed Gauss model.

10. an identity authorization system, is characterized in that, comprising:

The 3rd computing unit, for the likelihood score p (X|UBM) of the likelihood score p (X|U) according to described vocal print characteristic sequence and speaker model and described vocal print characteristic sequence and background model, calculate likelihood ratio

p = \frac{p (X | U)}{p (X | UBM)};

11. system as claimed in claim 10, is characterized in that, described the second computing unit comprises:

The first computation subunit, for calculating respectively the likelihood score of described vocal print characteristic sequence and each mixed Gauss model;

First determines subelement, determines the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user for the result of calculation according to described the first computation subunit.

12. system as claimed in claim 11, is characterized in that, described the first computation subunit comprises:

The first computing module, for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and described polyhybird Gauss model;

First selects module, the likelihood score for the time average of the likelihood score summation of selecting the corresponding mixed Gauss model of one group of vocal print feature of described vocal print characteristic sequence to calculate as described vocal print characteristic sequence and this mixed Gauss model.

13. system as claimed in claim 11, is characterized in that,

Described first determines subelement, the likelihood score specifically for the mean value of the likelihood score of selecting the corresponding all mixed Gauss models of described vocal print characteristic sequence to calculate as the speaker model of described vocal print characteristic sequence and described current login user; The maximum of the likelihood score of perhaps, selecting the corresponding all mixed Gauss models of described vocal print characteristic sequence to calculate is as the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user.

14. system as claimed in claim 10, is characterized in that, described the second computing unit comprises:

The second computation subunit, for calculating respectively the likelihood score of each vocal print feature of described vocal print characteristic sequence with respect to described polyhybird Gauss model;

Second determines subelement, determines the likelihood score of the speaker model of described vocal print characteristic sequence and described current login user for the result of calculation according to described the second computation subunit.

15. system as claimed in claim 14, is characterized in that, described the second computation subunit comprises:

The second computing module, for calculating respectively the likelihood score of each mixed Gauss model in each vocal print feature of described vocal print characteristic sequence and described polyhybird Gauss model;

Second selects module, for selecting maximum in likelihood score that in the corresponding described polyhybird Gauss model of vocal print feature of described vocal print characteristic sequence, each mixed Gauss model the calculates likelihood score as this vocal print feature and described polyhybird Gauss model; Perhaps select in described vocal print characteristic sequence the likelihood score of the mean value of all likelihood scores that in the corresponding described polyhybird Gauss model of a vocal print feature, each mixed Gauss model calculates as this vocal print feature and described polyhybird Gauss model.

16. system as claimed in claim 14, is characterized in that,

Described second determines subelement, specifically for selecting the likelihood score of the speaker model that each vocal print feature in described vocal print characteristic sequence is described vocal print characteristic sequence and described current login user with respect to the time average of the likelihood score of described polyhybird Gauss model.

17. system as described as claim 10 to 16 any one, is characterized in that,

Described voice signal receiving element, also for when the user registers, receive the registration voice signal of described user's typing;

Described system also comprises: the model construction unit, and for build described user's speaker model according to described registration voice signal, described model construction unit comprises:

The feature extraction subelement, for extracting the vocal print feature from described registration voice signal;

Topological structure is determined subelement, determines all mixed Gauss models of described user's speaker model for the number of repetition according to described registration voice signal and frame number;

The first estimator unit, estimate that for the vocal print feature of utilizing described feature extraction subelement to extract described topological structure determines the Gaussian mean parameter of all mixed Gauss models that subelement is determined;

The second estimator unit, estimate that for the vocal print feature of utilizing described feature extraction subelement to extract described topological structure determines Gauss's variance parameter of all mixed Gauss models that subelement is determined.

18. system as claimed in claim 17, is characterized in that,

Described topological structure is determined subelement, is less than or equal to the number of repetition of described registration voice signal specifically for the mixed Gauss model number of the speaker model of setting described user; Setting is less than or equal to the frame number of the registration voice signal that described mixed Gauss model is corresponding corresponding to the Gaussage of each mixed Gauss model.