CN106448685A

CN106448685A - System and method for identifying voice prints based on phoneme information

Info

Publication number: CN106448685A
Application number: CN201610880776.6A
Authority: CN
Inventors: 郑榕; 张策; 王黎明
Original assignee: Beijing Yuanjian Technologies Co ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2016-10-09
Filing date: 2016-10-09
Publication date: 2017-02-22
Anticipated expiration: 2036-10-09
Also published as: CN106448685B

Abstract

The invention relates to a system and a method for identifying voice prints based on phoneme information. The system comprises a phoneme forced aligning module, a phoneme relevance model creating module and a neural network classifier module, wherein the phoneme forced aligning module is based on a Chinese mandarin voice identifier; the neural network classifier module is based on a dropout strategy. The method comprises the following steps of defining sixteen phoneme types of Chinese mandarin numeric string voice prints, and digitally utilizing each piece of pronunciation type information of a numeric string; according to the Chinese mandarin voice identifier, using a viterbi forced aligning algorithm to obtain the phoneme boundary of the text content of each numeric string; using a text irrelevance algorithm to establish a phoneme relevance model; calculating the phoneme relevance model, so as to obtain a fraction vector. The system and the method have the advantages that the functions of division of phoneme information, modeling of phonemes and analysis of distinguishing capability of the phoneme relevance model are realized, the neural network training method based on the dropout strategy is proposed, the problem of phoneme missing of the numeric string is solved, and the performance of a numeric string voice print identifying system is improved.

Description

A kind of voiceprint authentication system based on phoneme information and method

Technical field

The present invention relates to voiceprint authentication system technical field, it particularly relates to a kind of vocal print based on phoneme information is recognized Card system and method.

Background technology

Living things feature recognition is a kind of physiological feature intrinsic according to human body itself and behavior characteristicss recognizing identity Technology, have the advantages that being difficult to forget, anti-counterfeiting performance is good, be difficult to forge or stolen, possess with oneself and can use whenever and wherever possible.With The Internet fast development, traditional identity authentication techniques means cannot increasingly meet the need of user experience and security capabilities Ask.Sound groove recognition technology in e easy to use, due to its wealthy application prospect, huge Social benefit and economic benefit, causes The extensive concern of all trades and professions and great attention.

Application on Voiceprint Recognition, also known as Speaker Identification, is one kind of biological identification technology.The technology is by reflecting in speech waveform Speak the speech parameter of human physiology and behavior characteristicss, and then tell speaker's identity.Convenient with safe, data acquisition The features such as.

In recent years, the Speaker Identification of text correlation (Text-dependent) becomes the focus in user authentication field.By In the major progress in unrelated (Text-independent) the Speaker Identification field of text, a lot of research worker are attempted by text no Close speaker's recognizer and be applied to text association area, such as numeric string Application on Voiceprint Recognition.

Under numeric string authentication condition, have research worker using simultaneous factor analysises (Joint Factor Analysis, JFA), gauss hybrid models-interference properties mapping (Gaussian Mixture Model-Nuisance Attribute Projection, GMM-NAP) and HMM-interference properties mapping (Hidden Markov Model-Nuisance Attribute Projection, HMM-NAP) it is compared.For comparing JFA, performed better than based on the algorithm of NAP, reason It is to train JFA that substantial amounts of tagged data is needed, and exists between the training data of JFA matrix and numeric string test data and lose Join.

In the unrelated Speaker Identification of text, JFA and be based on probability linear discriminant analysiss (Probabilistic Linear Discriminant Analysis, PLDA) population variance modeling factors (iVector) algorithm all rely on substantial amounts of exploitation Collection data.Increasing work is devoted to development set data in the limited field of process and asks to the migration of application data outside field Topic, the self adaptation of such as lexical gap and backoff algorithm.

By Android system (Android) and the mobile phone of apple system (iOS), record and construct the number comprising 536 people Word string voice set.It is divided into two kinds of scenes：Global condition and rand-n condition.Global condition represents that registration and checking are adopted Identical numeric string content；Rand-n condition represents that each digit string is random number word string of the length for n, and this is at certain More safer than global condition in the application system that anti-recording is attacked a bit.Be related in the present invention as shown in table 1 three kind registration/ Authentication condition：The whole numerical ciphers of fixation, dynamic 8 bit digital password and dynamic 6 bit digital password.Every kind of scene partitioning development set With evaluation and test collection.Development set is used for training global context model (Universal Background Model, UBM), population variance Modeling matrix (iVector T matrix) and linearly differentiation analysis matrix (Linear Discriminant Analysis, LDA) etc..In three kinds of conditions of evaluation and test collection, everyone comprising a three registration voices and tested speech, per bar tested speech with All speaker models are compared.

Table 1：Several form examples of password figure

Table 2 be GMM-NAP and using iVector voiceprint authentication system etc. error rate (Equal Error Rate, EER) contrast.As a result show, with the increase of digital string length, the performance of voiceprint authentication system has obtained significantly as one man carrying Rise.But GMM-NAP and iVector system does not all account for the utilization of phoneme (Phone/Phoneme) information, be based on text no Close direct application of the Application on Voiceprint Recognition under text associated scenario.In numeric string vocal print application, ignore phoneme information or without sound The effectively utilizes of prime information, it will limit the unrelated recognizer of text effect in actual applications.

Table 2：GMM-NAP and iVector system under different test conditions etc. error rate contrast

	The whole numerical ciphers of fixation	Dynamic 8 bit digital passwords	Dynamic 6 bit digital passwords
				GMM-NAP	2.09%	2.64%	3.76%
iVector	1.87%	2.40%	3.32%

Content of the invention

It is an object of the invention to a kind of voiceprint authentication system based on phoneme information and method is proposed, sound can realized While prime information cutting, phoneme model (Phone-dependent) model separating capacity analysis related with phoneme, number is solved The problem of word string phoneme disappearance, and improve the performance of numeric string voiceprint authentication system.

For above-mentioned technical purpose is realized, the technical scheme is that and be achieved in that：

A kind of voiceprint authentication system based on phoneme information, forces including the phoneme based on Chinese putonghua speech evaluator Alignment module, the model creation module of phoneme correlation and the neural network classifier module based on dropout strategy；

The phoneme based on Chinese putonghua speech evaluator forces alignment module to be used for 16 sounds to numeric string Plain classification carries out cutting；

The model creation module of the phoneme correlation is used for setting up phoneme correlation model, and analyzes each phoneme correlation model Separating capacity to voiceprint, features the differentiation feature of speaker, rather than difference between vocabulary；

The neural network classifier module based on dropout strategy is used for merging the complementary letter of phoneme correlation model Breath.

A kind of voiceprint authentication method based on phoneme information, comprises the steps：

S01：Define 16 phoneme class of standard Chinese numeric string vocal print, explicit each pronunciation using numeric string Classification information；

S02：Based on Chinese putonghua speech evaluator, force alignment algorithm to obtain each using Viterbi and correspond to numeric string The phoneme boundary of content of text, completes the phone segmentation to voice content, i.e. speech feature vector to the mapping relations of phoneme, obtains To the characteristic vector subclass for belonging to phoneme, each character subset conjunction is considered as independent data flow carries out subsequent treatment；

S03：Phoneme correlation model is set up using the unrelated algorithm of text, the model of phoneme correlation is set up process and reduces each The parameter amount of phoneme correlation model, it is to avoid training crossed by model；

S04：Phoneme correlation model is calculated, obtains scores vector.

Further, using the dropout Strategies Training rear end integrated classification device in neural network algorithm in step S04.

Beneficial effects of the present invention：

(1) present invention forces alignment algorithm to obtain using typical Chinese putonghua speech evaluator is based on using Viterbi Each phoneme boundary for corresponding to numeric string content of text is taken, the phone segmentation to voice content is completed, is based on compared to common The cutting effect of dynamic time warping (Dynamic Time Warping, DTW) scheduling algorithm is advantageously；

(2) present invention defines 16 pronunciation classifications to the numeric string pronunciation of standard Chinese, it is to avoid affiliated phoneme class Training problem crossed by the very few model for causing of characteristic vector, establishes phoneme correlation model, and analyzes each phoneme correlation model pair The separating capacity of voiceprint, phoneme correlation model features the differentiation feature of speaker, rather than the difference between vocabulary；

(3) in order to improve the Information Pull effect of phoneme correlation model further, and certification language in practical application is considered Sound only includes the partial content of set of phonemes, it is understood that there may be the problem of vector dimension disappearance, using dropout Strategies Training nerve Network backend grader, realizes the amalgamation judging of phoneme associated score vector, and has been obviously improved the systematic function of voiceprint.

Description of the drawings

Fig. 1 is the rear end grader process chart of the scores vector in the present invention based on phoneme correlation；

Fig. 2 be in the present invention for different phoneme correlation models etc. error rate experimental result picture.

Specific embodiment

With reference to the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground description, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, the every other embodiment obtained by those of ordinary skill in the art, belong to the model of present invention protection Enclose.

The present invention is proposed phoneme information explicitly using the numeric string voiceprint authentication method for being combined with neural network classification, For per bar digit string, alignment algorithm is forced to complete to voice content using the Viterbi of Chinese putonghua speech evaluator Phone segmentation；Reduce the training parameter amount of phoneme correlation model, it is to avoid the training phonetic feature of each phoneme model is less can Can caused crosses training problem, analyzes separating capacity of each phoneme model to Application on Voiceprint Recognition；Fraction to phoneme correlation model Vector there may be the problem of dimension disappearance, using the dropout Strategies Training rear end integrated classification device in neural network algorithm, The utilizing status of phoneme relevant information are improve, improves the systematic function of numeric string voiceprint further.

Table 3 gives the phonemic representation of ten standard Chinese numeric utterance.It is noted that digital " 1 " has " y i " and " y Two kinds of pronunciations of ao ", therefore correspond to ten standard Chinese numeric utterance and have 16 phonemes.

Table 3：Ten digital standard Chinese pronunciation phonemes

In " the whole numerical ciphers of fixation " condition, phoneme content immobilizes." dynamic 8 bit digital password " and " dynamic 6 The phoneme content of numerical ciphers " is also known, because the random algorithm that digital text is typically based on background system is pushed or base Generated according to special algorithm in OTP dynamic password (One-time Password).

Based on Chinese putonghua speech identifying system, force alignment algorithm to obtain each using Viterbi and correspond to content of text Phoneme boundary, complete the phone segmentation to voice content, i.e. speech feature vector to the mapping of phoneme.

Therefore, the acoustic featuress sequence vector χ=x of a hop count word string voice is given₁,...,x_T, discrete son can be cut into Set χ₁,...,χ₁₆.Wherein x ∈ χ_iRepresent the characteristic vector subclass for belonging to i-th phoneme.Each subclass is considered as solely Vertical data flow carries out subsequent treatment.The voiceprint registration stage, the model of 16 phoneme correlations(i-th sound of speaker s Sub-prime set) obtained by the unrelated Algorithm for Training of text.It should be noted that registration voice needs to cover ten numerals.This In bright, registration phase registers voice using three numeric strings, it is ensured that in everyone registration voice, each numeral at least occurs one Time.

During voiceprint, for " the whole numerical ciphers of fixation " condition, the scores vector ξ of ten 6 DOFs is obtained,Can be by averaging to scores vector ξ or the method such as logistic regression trains rear end grader to enter Row judgement.But for the rand-n condition such as " dynamic 8 bit digital password " and " dynamic 6 bit digital password ", scores vector ξ may There is disappearance, because tested speech only includes the partial content of set of phonemes.In order to solve the problem, using neural network algorithm In dropout strategy, this be a kind of effective lifting generalization ability implementation method.

The dropout training algorithm of neutral net is stochastic gradient descent (the Standard Stochastic of standard Gradient Descent), simply during forward calculation, some input blocks and hidden layer are ignored at random with certain probability γ Unit.Only activation unit participates in back propagation (Back-propagation) and gradient calculation.Because dropout is not used to Identification, in the training process, the output to per layer is readjusted：

Wherein δ (), W_lAnd b_lIt is activation primitive, the weight of l layer and the biasing of l layer respectively.Bm is binary cover (Binary mask) represents which dimension is disallowable, and * represents vector multiplication.

Said process can regard a kind of effective model averaging method as, i.e., by train the disappearance of substantial amounts of shared weight to The average expression of the heterogeneous networks for measuring.As shown in figure 1, neural network classifier of the training package containing a hidden layer.Wherein defeated It is scores vector to enter, and output includes two units, represents target authentication classification respectively and emits imitative certification classification.For " dynamic 8 Vector dimension disappearance problem under the conditions of the rand-n such as numerical ciphers " and " dynamic 6 bit digital password ", to input layer with probability γ Application dropout strategy carries out network training.In Qualify Phase, the log-likelihood being calculated as follows is used for system output：

Wherein p (ξ | target verification class) and p (ξ | emit imitative checking class) is the likelihood score of scores vector ξ.By Bayes's public affairs Formula, likelihood score can be exchanged into posteriority and represent,

P (ξ | target verification class)=p (target verification class | ξ) p (ξ)/p (target verification class)

P (ξ | emit imitative checking class)=p (emit imitative checking class | ξ) p (ξ)/p (emitting imitative checking class)

Wherein p (target verification class | ξ) and p (emit imitative checking class | ξ) it is that scores vector ξ is obtained by network forward calculation Posteriority.P (target verification class) and p (emitting imitative checking class) is to estimate the priori of the target verification class for obtaining from training set and emit imitative The priori of checking class.P (ξ) is unrelated with any model, can ignore during LLR is calculated.

Each phoneme model separating capacity to Application on Voiceprint Recognition is analyzed first.Training voice in view of each phoneme model Feature is less, in order to avoid crossing training problem, reduces the training parameter amount of each phoneme correlation model.Fig. 2 gives each Phoneme correlation model etc. error rate contrast.

From figure 2 it can be seen that first, in all phoneme correlation models, iVector is to be better than GMM- more by a small margin NAP model.Secondly, the EER numerical value of the worst consonant of performance " w " is five times of the EER of the best vowel of performance " an " or so.This Individual experimental result has directive function to practical application, and on-line system can be limited and push the bad numeral of performance, and such as " 5 [wu]”.

By dropout neutral net rear end grader is trained, fusion output is carried out to the scores vector of phoneme correlation.Table 4 give phoneme correlation model using different rear ends graders etc. error rate contrast.Compare for convenience, give also here The authentication performance averaged by the phoneme associated score of GMM-NAP and iVector system.Fraction average formula is as follows：

Table 4：Phoneme correlation model using different rear ends graders etc. error rate contrast

From table 4, it can be seen that of the present invention based on phoneme information explicitly using the calculation with the fusion of neutral net rear end Method can effectively lift the systematic function of numeric string voiceprint.The result average compared to fraction, neutral net rear end is divided Class device etc. error rate lower, performance is more excellent.With GMM-NAP the and iVector Comparative result of table 2, register/recognize in three kinds of differences Under the conditions of card, the algorithm of phoneme correlation model and neutral net rear end grader is all achieved under about 20% or so relative EER Drop.

Presently preferred embodiments of the present invention is the foregoing is only, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement that is made etc., should be included within the scope of the present invention.

Claims

1. a kind of voiceprint authentication system based on phoneme information, it is characterised in that include based on Chinese putonghua speech evaluator Phoneme force alignment module, phoneme correlation model creation module and based on dropout strategy neural network classifier mould Block；

The phoneme based on Chinese putonghua speech evaluator forces alignment module to be used for 16 phoneme classes to numeric string Cutting is not carried out；

The model creation module of the phoneme correlation is used for setting up phoneme correlation model, and analyzes each phoneme correlation model to sound The separating capacity of stricture of vagina certification；

The neural network classifier module based on dropout strategy is used for merging the complementary information of phoneme correlation model.

2. a kind of voiceprint authentication method based on phoneme information, it is characterised in that comprise the steps：

S01：Define 16 phoneme class of standard Chinese numeric string vocal print, explicit each pronunciation classification using numeric string Information；

S02：Based on Chinese putonghua speech evaluator, force alignment algorithm to obtain each using Viterbi and correspond to numeric string text The phoneme boundary of content, completes the phone segmentation to voice content, obtains belonging to the characteristic vector subclass of phoneme；

S03：Phoneme correlation model is set up using the unrelated algorithm of text；

S04：Phoneme correlation model is calculated, obtains scores vector.

3. the voiceprint authentication method based on phoneme information according to claim 2, it is characterised in that adopt in step S04 Dropout Strategies Training rear end integrated classification device in neural network algorithm.