A kind of method for recognizing sound-groove
Technical field
The invention belongs to computing machine and information service technical field, particularly the method for identity being differentiated and being confirmed by the mankind's voiceprint.
Background technology
Application on Voiceprint Recognition (Voiceprint Recognition), i.e. Speaker Identification (Speaker Recognition), the speaker's of being contained in exactly will sound biological characteristic according to the people, whom the people who identifies certain section voice is, i.e. so-called " knowing the people on hearing the news ".Application on Voiceprint Recognition can be used in nearly all security protection field that needs the identity discriminating or confirm and personalized application.For example: (1) vocal print is differentiated: criminal investigation and case detection, criminal's tracking, national defence monitoring, personalized application etc.; (2) vocal print is confirmed: security exchange, bank transaction, public security evidence obtaining, PC and automobile acoustic control lock, I.D., credit card or the like.
As everyone knows, everyone fingerprint all is unique, and similarly, everyone vocal print also has certain uniqueness, is difficult to find two duplicate people of vocal print, and this just provides foundation for reliable Application on Voiceprint Recognition theoretically.General method for recognizing sound-groove comprises training process (or claiming learning process) 1 of model and 2 two parts of identifying of vocal print, as shown in Figure 1.The model training process is to extract acoustic feature vector (or claiming acoustic feature, eigenvector, feature)---feature extraction that Here it is from sound waveform, and an acoustic model of everyone acoustic feature foundation becoming, be called sound-groove model, thereby form the process of a model bank; And the identifying of vocal print is exactly that acoustic feature and the sound-groove model in the model bank that people's sound to be identified extracts are carried out matching ratio, thereby draws the process of court verdict.
Method for recognizing sound-groove comprise text relevant with two types of text-independent.The former requires, and the speaker must say the certain contents such as sentence, phrase, speech or word of making an appointment in identifying; And the latter to the said content of speaker without any restriction, no matter train still identification, the speaker can arbitrarily say any content of any language.Obviously, latter's difficulty is big, but easy to use, applied range.
The performance of an Application on Voiceprint Recognition system depends on several factors, but the power of the descriptive power of the quality of feature extraction and acoustic model is two very important aspects.
The method of extraction acoustic feature commonly used comprises in the method for recognizing sound-groove at present: (1) linear prediction cepstrum (LPCC) parameter; (2) beautify cepstrum parameter (MFCC); Or the like.
And acoustic model has following several modeling method commonly used:
(1) template matching method: adopt dynamic time bending (DTW) algorithm to aim at training and identification (test) characteristic sequence, be primarily aimed at the fixing application (being generally the text inter-related task) of phrase.
(2) arest neighbors method: when training, need keep all acoustic feature vectors; When identification/test, each vector is all found K nearest in the trained vector, and discern judgement in view of the above.Make in this way, the memory space of model and calculation of similarity degree amount are all very big.
(3) neural net method: a variety of forms are arranged, comprise Multilayer Perception, radial basis function (RBF) etc.It goes to enlarge speaker model and other differences between model by explicit training, to attempt to reach maximum separability.Its shortcoming is that training burden is very big, and convergence speed is slow, but and the generalization of model bad.
(4) (Hidden Markov Model, HMM) method: it supposes that people's sound is by two process control to hidden Markov model, and one is the state transitions process, and one is the acoustic feature vector output procedure.This method is delineation people's a pronunciation mechanism good mathematical model.Usually, the acoustic feature vector output procedure distributes with mixed Gaussian and goes delineation.
(5) (Gaussian Mixture Model, GMM) method: gauss hybrid models is actually the hidden Markov model of single state to gauss hybrid models.Suppose that the acoustic feature vector sequence is X={X
1..., X
T, the observation characteristic sequence that need calculate when discerning so calculates with following formula with respect to the log-likelihood branch (being called for short likelihood branch, matching score, score) of speaker model M:
More than in many common methods, be best with the effect of hidden Markov model method and gauss hybrid models method.But their overall performance is also unsatisfactory, and can not reach optimum efficiency in the Application on Voiceprint Recognition of text-independent.In addition, these methods also need long voice to provide discriminating accurately or affirmation to the speaker usually.
The recognition methods of vocal print comprises two types, and promptly vocal print is differentiated or identification (Voiceprint Identification) and vocal print affirmation (Voiceprint Verification).Wherein, vocal print is confirmed in order to confirm that whether certain section voice really are exactly that certain specific people of declaring is said, belong to 1-1 decision problem; The type is shown in Fig. 2 (a), its step comprises: the feature vector sequence with the voice to be confirmed that pass through front-end processing deducts this feature vector sequence and the matching score of declaring the corresponding background model of speaker with declaring the speaker model matching score, obtain Λ as a result, then Λ and a pre-set threshold θ are adjudicated, if Λ>θ, then accept this affirmation result, think that promptly these voice to be confirmed declare that the speaker says; If Λ<θ then refuses to know this affirmation result, think that promptly these voice to be confirmed do not declare that the speaker says.Here said to refuse to know be exactly that a result wrong refuses, so vocal print confirms that just vocal print refuses to know judgement.
Vocal print differentiate in order to judge that certain section voice are some philtrums who is said, belong to N-1 and select problem; Wherein vocal print is differentiated and is divided into closed set and two kinds of situations of opener again.The closed set vocal print is differentiated, shown in Fig. 2 (b), be that the feature vector sequence of the voice to be identified of process front-end processing and all speaker models in the model bank are carried out matching ratio one by one, obtain the matching score S and the corresponding speaker numbering of maximum (MAX), think that promptly this section voice to be identified are exactly that speaker of matching score maximum is said, the closed set vocal print differentiates that whether the speaker who does not check this section voice really is exactly this people in the sound-groove model storehouse.The opener vocal print differentiates then after finishing the closed set vocal print and differentiating a speaker who obtains in the sound-groove model storehouse, need further utilize the judgement of vocal print confirmation method to accept or refuses to know this identification result.
In actual applications, differentiate that vocal print is confirmed and the opener vocal print differentiates that bigger demand is arranged, and in the application aspect these two, it is crucial refusing the knowledge problem with respect to the closed set vocal print.In order to refuse to know, need background (Background) model usually or claim personator (Impostor) model.Background model be built with dual mode, the one, any one speaker M has one or one group of corresponding background model Bkg (M); Two are to use a universal background model UBM who has nothing to do with the speaker (Universal Background Model), and promptly to any speaker M, its background model all is Bkg (M)=UMB.On this basis, as a characteristic sequence X={X
1..., X
TThe time, can obtain it and be with respect to the likelihood mark Λ (X|M) of speaker M:
Wherein P (X|M) calculates by the mixed Gaussian density calculation formula of standard.Then, can determine that according to the relation between likelihood mark Λ (X|M) and the pre-set threshold θ this section voice are sound (Λ (X|M)>θ), or be not the sound (Λ (X|M)<θ) of speaker M of speaker M.Clearly, the setting of threshold value θ is very crucial to refusing to know, and because it is normally predefined, thereby can not adapt to requirement of actual application sometimes.
The subject matter of existing rejection method for identifying is, refuses to know threshold value and fixes, thereby bring difficulty for the setting and the application of vocal print recognition system under different hardware and software environment of threshold value.
Summary of the invention
The objective of the invention is for overcoming the deficiencies in the prior art part, a kind of new method for recognizing sound-groove is proposed, the present invention is by adopting serial of methods such as vector quantization cluster, maximum local template matches, automatic threshold estimation, multistage criterion judgement, eliminated the correlativity of Application on Voiceprint Recognition performance well to content of text, eliminated the dependence of Application on Voiceprint Recognition performance well, and made and refuse to know threshold value and can automatically obtain by training to voice length.
The present invention proposes a kind of method for recognizing sound-groove, comprises the training method of model and two parts of recognition methods of vocal print, and the step of this model training method is:
1) from each speaker's sound waveform, extracts acoustic feature, form this speaker's feature vector sequence;
2) be respectively everyone according to each speaker's feature vector sequence and make up a sound-groove model, each individual sound-groove model is put together form a model bank;
The recognition methods of this vocal print is:
3) from people's to be identified sound, extract acoustic feature and form feature vector sequence to be identified;
4) sound-groove model in this feature vector sequence to be identified and this model bank carries out matching ratio one by one, obtains the matching score (be also referred to as the log-likelihood score, or the likelihood score, or score) of feature vector sequence and each speaker's sound-groove model, adjudicates;
5) according to the type (the closed set vocal print is differentiated, the opener vocal print is differentiated and vocal print is confirmed) of the recognition methods of vocal print, in needs, refuse to know judgement, thereby obtain a result;
It is characterized in that: the method that makes up sound-groove model for each speaker said step 2) is: the feature vector sequence to said speaker adopts traditional LBG algorithm to carry out cluster, obtain the mixing of K Gaussian distribution, wherein k Gaussian distribution mean value vector is μ
k, the diagonal angle variance matrix is ∑
kThe number percent that k eigenvector number that Gaussian distribution contained accounts for vector sum in the full feature vector sequence during note LBG cluster is w
k, then this speaker's sound-groove model is M={ μ
k, ∑
k, w
k| 1≤k≤K};
The said the 4th) the feature vector sequence X={X to be identified in the step
1..., X
TAnd speaker's sound-groove model M={ μ
k, ∑
k, w
k| the matching score of 1≤k≤K} (log-likelihood score) S (X|M) utilizes the method for calculating probability based on the maximum template matches in part to obtain, that is:
The present invention has following feature:
1) performance of Application on Voiceprint Recognition and said text and used language independent;
2) vocal print is differentiated and can be operated in the opener recognition mode, that is: can refuse to know to the personation speaker;
3) can with unsupervised mode to opener identification refuse know threshold value and carry out estimating automatically reliably;
4) length to voice does not have special requirement, only needs several seconds very short voice, just can train reliably and discern;
5) have very high accuracy of identification: the accuracy of both speaker. identification and affirmation is not less than 98%; False acceptance rate that vocal print is refused to know and false rejection rate all are lower than 1%;
6) model storage requisite space is little: each speaker's sound-groove model storage space is all less than 5KB;
7) the operating point threshold value of Application on Voiceprint Recognition is easy to adjust: according to " accuracy rate+uncertain rate+error rate=100% ", can adjust the operating point threshold value by different application demands, make final accuracy rate (the first-selected accuracy of acceptance) reach the highest or make error rate (false acceptance rate or false rejection rate) drop to minimum.
The present invention is used in ecommerce, automated information retrieval, personalized service etc., comprises security personnel's (comprise the gate inhibition, encrypt credit card etc.), finance and economics (comprise bank transfer accounts automatically, inquire about and cashier etc.), national defence (comprising telephone monitoring tracking, the discriminating of enemy and we officers and men's identity etc.), police and judicial fields such as (comprising criminal investigation tracking, evidence obtaining, identity discriminating etc.).
Description of drawings
Fig. 1 is the The general frame of existing method for recognizing sound-groove.
Fig. 2 is two types of the recognition methods of existing vocal print: vocal print is differentiated and vocal print confirmation method block diagram.
Fig. 3 is the embodiment The general frame of method for recognizing sound-groove of the present invention.
Fig. 4 be the inventive method refuse to know training method embodiment block diagram.
Embodiment
A kind of method for recognizing sound-groove that the present invention proposes reaches embodiment in conjunction with the accompanying drawings, and application is described in detail as follows:
Method for recognizing sound-groove embodiment of the present invention shown in Fig. 3 (a)-Fig. 3 (c), comprises that model training method and vocal print are differentiated and the recognition methods of two types the vocal print that vocal print is confirmed, accompanying drawings is as follows respectively:
The model training method of present embodiment is shown in Fig. 3 (a), and its concrete steps comprise:
1) gets a speaker's voice data, its raw tone Wave data is analyzed, throw and remove wherein each quiet section;
2) wide and wide half of frame be that frame moves with 32 milliseconds of frames, each frame extracted the linear prediction cepstrum parameters (LPCC) of 16 dimensions, and calculate its auto-regressive analysis parameter, forms 32 eigenvectors of tieing up; The eigenvector composition characteristic vector sequence of all frames;
3) make up this speaker's sound-groove model:
Feature vector sequence to the speaker adopts traditional LBG algorithm to carry out cluster, obtains the mixing of K Gaussian distribution, and wherein k Gaussian distribution mean value vector is μ
k, the diagonal angle variance matrix is ∑
kThe number percent that k eigenvector number that Gaussian distribution contained accounts for vector sum in the full feature vector sequence during note LBG cluster is w
k, then this speaker's sound-groove model is:
M={ μ
k, ∑
k, w
k| 1≤k≤K}, and deposit the sound-groove model storehouse in;
4) if also have not training of speaker, then change the training that step 1) is carried out next speaker; Otherwise training process finishes.
The vocal print discrimination method of present embodiment specifically may further comprise the steps shown in Fig. 3 (b):
1) collection speaker's to be identified voice data is analyzed its raw tone Wave data, throws and removes wherein each quiet section;
2) the wide and frame of identical frame moves when training with sound-groove model, each frame is extracted the linear prediction cepstrum parameters (LPCC) of 16 dimensions, and calculate its auto-regressive analysis parameter vector, forms 32 dimensional feature vectors to be identified; The eigenvector to be identified of all frames is formed feature vector sequence X={X to be identified
1..., X
T;
3) from the sound-groove model storehouse, get a speaker's sound-groove model M;
4) utilize method for calculating probability to obtain feature vector sequence X={X to be identified based on the maximum template matches in part
1..., X
TAnd speaker's sound-groove model M={ μ
k, ∑
k, w
k| the matching score of 1≤k≤K} (log-likelihood score) S (X|M), that is:
And note;
5) if also have speaker's matching score not calculate, then change step 3);
6) take out the mark S of matching score maximum in eigenvector to be identified and all speakers' the sound-groove model
MaxAnd corresponding speaker M
MaxAs the recognition result candidate;
7) if the discriminating of closed set vocal print, then M
MaxIt is exactly identification result; Otherwise with M
MaxAs declare the speaker, with universal background model model as a setting, utilize vocal print affirmation technology that the result is refused to know judgement;
8) output result, the vocal print discrimination process finishes.
The vocal print confirmation method of present embodiment specifically may further comprise the steps shown in Fig. 3 (c):
1) collection speaker's to be confirmed voice data is analyzed its raw tone Wave data, throws and removes wherein each quiet section;
2) the wide and frame of identical frame moves when training with sound-groove model, each frame is extracted 32 linear prediction cepstrum parameter (LPCC), and calculate its auto-regressive analysis parameter vector, forms the eigenvectors of 32 dimensions: the eigenvector composition characteristic vector sequence of all frames:
3) speaker's sound-groove model and background model thereof are declared in taking-up;
4) refuse to know judgement;
5) output result, vocal print affirmation process finishes.
Of the present inventionly refuse to know decision method embodiment, shown in Fig. 4 (a)-4 (d), can comprise two parts of training of refusing to know and the judgement of refusing to know, this refuses to know training, may further comprise the steps:
1) training background model;
2) training refuses to know threshold value;
3) training sound-groove model specifically comprises:
(1) gets a speaker's voice data, calculate its effective feature vector sequence;
(2) train this speaker's sound-groove model;
(3) select Q background model for this speaker;
(4) deposit this speaker's sound-groove model and the parameter relevant in the sound-groove model storehouse with Q background sound-groove model:
(5) repeating step (1) is finished up to all speakers' sound-groove model training to (4).
The embodiment of above-mentioned training background model, shown in Fig. 4 (a), must before using Application on Voiceprint Recognition first, carry out, specifically comprise: the voice data of collecting N background speaker, and train background speaker's sound-groove model respectively by the training method of sound-groove model, N altogether, they are called the background sound-groove model, and deposit background vocal print model bank in.
Above-mentioned training refuses to know the embodiment of threshold value, shown in Fig. 4 (b), specifically may further comprise the steps:
(1) gets n background model M
n={ μ
Nk, ∑
Nk, w
Nk| 1≤k≤K} and corresponding feature vector sequence thereof
By formula (3) calculate the matching score between them
(2) calculate the number percent CAP of the eigenvector of the sound to be identified that the Gaussian distribution critical section falls into:
Wherein TSH be in order to expression mixed Gaussian density critical zone size threshold value (TSH can get 1.0 usually, and it is more little to be worth more little then critical zone, controls also strict more:
(3) by formula (3) calculate this feature vector sequence X respectively
nWith remove M
nOutside the matching score of each background model, Q background model before getting by score order from big to small, it must be divided into S
In1..., S
InQ
(4) repeating step (1)~(3) are all calculated up to the above-mentioned value of all n=1-N background model and to be finished;
(5) obtain S minimum in all background models
TOP (n)Value is multiplied by one less than 1.0 coefficient, as the threshold value of likelihood score;
(6) obtain in all background models minimum CAP value, be multiplied by one less than 1.0 coefficient, as the threshold value of CAP;
(7) obtain in all background models minimum
Value is multiplied by one less than 1.0 coefficient, as the threshold value of likelihood score difference;
(8) calculated the threshold value of score resultant distortion value by formula (5), wherein β is the coefficient greater than 1.0:
The coefficient of being taken advantage of in the above-mentioned threshold value estimation process is not fixed, and all can float with the adjustment of " operating point " threshold value, to satisfy concrete requirement of using.
The embodiment of an above-mentioned selection Q background model is shown in Fig. 4 (c).This selection course is used for background model training back to the process that the speaker carries out the vocal print training, specifically may further comprise the steps:
(1) as feature vector sequence X={X with this speaker
1..., X
TTrain its sound-groove model M={ μ
k, ∑
k, w
k| behind 1≤k≤K}, utilize formula (3) to calculate the matching score S of X and M
TOP=S (X|M);
(2) calculate the matching score of X and N background model with formula (3), select the mark S of the preceding Q name background model of matching score maximum by order from big to small
I1..., S
IQAnd index I
1, I
Q
(3) with S
TOP, S
I1..., S
IQAnd I
1..., I
QDeposit in this speaker's sound-groove model.
The embodiment of the above-mentioned judgement of refusing to know is shown in Fig. 4 (d).This judging process is used for the vocal print affirmation or the opener vocal print is differentiated, treats the feature vector sequence X={X of sound recognition
1..., X
TAnd target speaker M={ μ
k, ∑
k, w
k| 1≤k≤K} carries out consistance judgement, and wherein target speaker M may be the speaker that declares during the candidate as a result that differentiates of vocal print or vocal print are confirmed.Specifically may further comprise the steps:
(1) by formula the match likelihood of (3) calculated characteristics vector sequence X and target speaker's sound-groove model M must be divided into RTOP;
(2) by formula (3) calculate the matching score R of Q the background model of X and M respectively
I1..., R
IQ, and by formula (6) calculate the resultant distortion value:
(3) by formula (4) calculate the eigenvector number percent of the voice data to be identified that the Gaussian distribution critical section falls into, i.e. CAP mark:
(4) refuse to know judgement:
G) if likelihood score R
TOPBe lower than likelihood score threshold value and then refuse recognition result;
H) if score C AP (X|M) is lower than the CAP threshold value and then refuses recognition result;
I) if R
TOPAnd R
I1..., R
IQAfter pressing ordering from big to small together, R
TOPThe rank ranking too by the back (being lower than the 2nd) as rank then refuse recognition result;
J) if R
TOPWith R
I1..., R
IQIn be lower than its maximum score the absolute value of difference then refuse recognition result less than the score difference threshold;
K) if resultant distortion value DIV (X|M) then refuses recognition result greater than resultant distortion value threshold value;
Then accept recognition result when all not refusing l).