CN109378002A

CN109378002A - Method, apparatus, computer equipment and the storage medium of voice print verification

Info

Publication number: CN109378002A
Application number: CN201811184693.9A
Authority: CN
Inventors: 杨翘楚; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2019-02-22
Anticipated expiration: 2038-10-11
Also published as: WO2020073518A1; CN109378002B

Abstract

This application discloses the method, apparatus of voice print verification, computer equipment and storage mediums, wherein the method for voice print verification, comprising: the voice signal to voice print verification is input in VAD model, distinguishes the speech frame and noise frame in the voice signal；The noise frame is removed, the voice data of the purification of each speech frame composition is obtained；Extract corresponding first vocal print feature of voice data of the purification；Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature；If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.The application removes the voice data that noise data is purified by the noise data in recognition of speech signals, then carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.

Description

Method, apparatus, computer equipment and the storage medium of voice print verification

Technical field

This application involves voice print verification field is arrived, especially relate to the method, apparatus of voice print verification, computer equipment and Storage medium.

Background technique

Currently, the scope of business of many large size financing corporations is related to multiple business such as insurance, bank, investment, and it is every A business usually requires same client and links up, and requires to carry out instead cheating identification, therefore, tests the identity of client Card and anti-fraud identification also just become the important component for guaranteeing service security.In client identity verifying link, vocal print is tested Demonstrate,prove the real-time having due to it and easily just property and used by many companies.In practical applications, by speaker in identity registration Or such environmental effects locating for authentication link, collected voice data are made an uproar often with the non-background from speaker Sound, this factor become one of the principal element for influencing voice print verification success rate.

Summary of the invention

The main purpose of the application is to provide a kind of method of voice print verification, it is intended to solve the noise in existing voice data The technical issues of adverse effect is generated to voice print verification effect.

A kind of method that the application proposes voice print verification, comprising:

Voice signal to voice print verification is input in VAD model, the speech frame in the voice signal is distinguished and is made an uproar Sound frame；

The noise frame is removed, the voice data of the purification of each speech frame composition is obtained；

Extract corresponding first vocal print feature of voice data of the purification；

Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature；

If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.

It preferably, include the GMM-NOISE and GMM- of Fourier transformation, Gaussian Mixture distribution in the VAD model SPEECH, it is described that the voice signal is input in VAD model, distinguish the step of the speech frame and noise frame in voice signal Suddenly, comprising:

The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal Form is changed into frequency-region signal form；

Each frame data of the voice signal of frequency-region signal form are separately input to the GMM-NOISE and GMM- VAD judgement is carried out in SPEECH, to distinguish the speech frame and noise frame in voice signal.

Preferably, each frame data of the voice signal by frequency-region signal form are separately input to the GMM- VAD judgement is carried out in NOISE and GMM-SPEECH, the step of to distinguish the speech frame and noise frame in voice signal, comprising:

Each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, Respectively obtain the noise frame probability of each frame dataWith speech frame probability

According toCalculating office Portion's log-likelihood ratio；

Judge whether the partial log likelihood ratio is higher than local gate limit value；

If so, the frame data for determining that the partial log likelihood ratio is higher than local gate limit value are speech frame.

Preferably, it is described judge the step of whether log-likelihood ratio is higher than local gate limit value after, comprising:

If partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio；

Judge whether the global log-likelihood ratio is higher than global threshold；

If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than global threshold Frame data be speech frame.

Preferably, the step of voice data corresponding first vocal print feature for extracting the purification, comprising:

Extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification；

The corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature；

Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, it is described to obtain Corresponding first vocal print feature of each speech frame in the voice data of purification.

Preferably, described to judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature The step of, comprising:

It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein is more Vocal print feature is prestored including target person in personal vocal print feature data；

Calculate separately each similarity value prestored between vocal print feature and first vocal print feature；

Each similarity value is ranked up according to sequence from big to small；

In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print feature pair including the target person The similarity value answered；

If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise not Meet preset condition.

Preferably, described to calculate separately each similarity value prestored between vocal print feature and first vocal print feature The step of, comprising:

Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and described the COS distance value between one vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the sound of the first vocal print feature Line discriminant vectors；

The COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding maximum Similarity value.

Present invention also provides a kind of devices of voice print verification, comprising:

Discriminating module is distinguished in the voice signal for the voice signal to voice print verification to be input in VAD model Speech frame and noise frame；

Module is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition；

Extraction module, corresponding first vocal print feature of voice data for extracting the purification；

Judgment module, for judging whether first vocal print feature meets default item with the similarity for prestoring vocal print feature Part；

Determination module, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition It is identical, it is otherwise not identical.

Present invention also provides a kind of computer equipment, including memory and processor, the memory is stored with calculating The step of machine program, the processor realizes the above method when executing the computer program.

Present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, which is characterized in that The computer program realizes the step of above-mentioned method when being executed by processor.

The application removes the voice data that noise data is purified by the noise data in recognition of speech signals, Then Application on Voiceprint Recognition is carried out according to purified voice data, improves the accuracy of voice print verification.The application passes through GMM-VAD mould Type, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification voice letter Number degree, further increase the accuracy of voice print verification.The application is based on GMM-UBM and realizes each vocal print feature vector It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced The use cost of voice print verification.The application, by being compared analysis with the pre-stored data of more people, drops during voice print verification Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.

Detailed description of the invention

The method flow schematic diagram of the voice print verification of one embodiment of Fig. 1 the application；

The apparatus structure schematic diagram of the voice print verification of one embodiment of Fig. 2 the application；

The computer equipment schematic diagram of internal structure of one embodiment of Fig. 3 the application.

Specific embodiment

It should be appreciated that specific embodiment described herein is only used to explain the application, it is not used to limit the application.

Referring to Fig.1, the method for a kind of voice print verification of one embodiment of the application, comprising:

S1: the voice signal to voice print verification is input in VAD model, distinguish speech frame in the voice signal and Noise frame.

The VAD model of the present embodiment, also known as speech terminals detection device, for detecting whether that there are voice in noise circumstance Voice data.VAD model by giving a mark to each frame voice signal of input, i.e., the frame voice signal be speech frame or The probability of noise frame, when speech frame probability value be greater than preset decision threshold, then be determined as speech frame, be otherwise noise Frame.VAD model distinguishes speech frame and noise frame according to above-mentioned court verdict, to remove the noise in voice signal Frame.The decision threshold of the present embodiment uses the decision threshold defaulted in Webrtc source code, which is Webrtc skill It is got when art is developed by analyzing mass data, to improve the effect and accuracy distinguished, and reduces the mould of VAD model simultaneously Type training amount.

S2: removing the noise frame, obtains the voice data of the purification of each speech frame composition.

The present embodiment is according to above-mentioned differentiation as a result, by cutting off labeled as the data of noise frame, by remaining each institute's predicate Sound frame forms the voice data of the purification of each speech frame composition according to the successively continuous arrangement of former time of allocation sequence. The application other embodiments can also by above-mentioned differentiation as a result, selection markers be speech frame data extract preservation, will mention Each speech frame deposited go bail for according to the successively continuous arrangement of former time of allocation sequence, forms the described of each speech frame composition The voice data of purification.The present embodiment is by coming from speaker for the non-of the environment locating for identity registration or authentication link Background noise data get rid of, influence of the noise data to voice print verification effect in voice signal is reduced, to improve vocal print It is proved to be successful rate.

S3: corresponding first vocal print feature of voice data of the purification is extracted.

The present embodiment is by corresponding first vocal print feature of voice data that only analysis purifies, to reduce in voice print verification Calculation amount, and the validity, specific aim and timeliness of voice print verification are improved simultaneously.

S4: judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature.

The preset condition of the present embodiment includes specified preset threshold range or specified sequence etc., can be according to specific Application scenarios carry out customized setting, broadly to meet personalized use demand.

S5: if satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.

The present embodiment will determine that first vocal print feature prestores that vocal print feature is identical with described, then feedback validation passes through As a result client is arrived, otherwise, the result of feedback validation failure is to client, so that client carries out further according to feedback result Application operating.Citing ground controls intelligent door opening etc. after being verified.With illustrating again, it is controlled after authentication failed predetermined number of times Security system carries out screen locking, further destroys e-banking system to prevent offender.

Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and GMM-SPEECH, step S1, comprising:

S100: the voice signal is input in the Fourier transformation in VAD model, by the voice signal from time domain Signal form is changed into frequency-region signal form.

Time-domain signal form is converted into frequency domain letter by the Fourier transformation in VAD model by the present embodiment correspondingly Number form facilitates to carry out analyzing the attribute of each frame voice signal and distinguishes speech frame and noise frame.

S101: by each frame data of the voice signal of frequency-region signal form be separately input to the GMM-NOISE and VAD judgement is carried out in GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.

The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, and to noise and voice at 6 Gaussian Mixture distribution GMM modeling is carried out in frequency range respectively, is had in each frequency range containing there are two the noise GMM- of Gaussian component NOISE and containing there are two the voice GMM-SPEECH of Gaussian component.Above-mentioned 6 frequency ranges are based on noise and language according to Webrtc technology The frequency spectrum difference of sound is configured, to improve accuracy of analysis and the matching with Webrtc technology.Other realities of the application The analysis frequency range for applying example must be not necessarily 6, can be set according to actual needs.And the present embodiment is exchanged based on China Electric standard is 220V, 50Hz, and the interference of power supply 50Hz can be mixed into the microphone of acquisition voice signal, collected interference signal And physical shock can affect, the voice signal of the present embodiment preferred acquisition 80Hz or more, to reduce the dry of alternating current It disturbs, and the attainable highest frequency of voice is 4kHz, so frequency spectrum trough of the present embodiment preferably within the scope of 80Hz to 4kHz Locate division limits.The VAD judgement of the present embodiment includes local decisions (Local Decision) and global decision (Global Decisioin)。

Further, the step S101 of the present embodiment, comprising:

S1010: by each frame data of the voice signal of frequency-region signal form, GMM-NOISE and GMM- are separately input to In SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability

The present embodiment by be by preanalysis speech frame or noise frame voice signal each frame data, be separately input to In GMM-NOISE and GMM-SPEECH, the noise frame probability that GMM-NOISE and GMM-SPEECH analyzes each frame data respectively is obtained Value and speech frame probability value, so as to by comparing noise frame probability value and speech frame probability value size, so that determination is noise Frame or speech frame.

S1011: according to Calculate partial log likelihood ratio.

The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, so n value in the present embodiment It is 6, when judgement each frame, can all carries out 6 local decisions, i.e., carry out local decisions respectively in 6 frequency ranges, As long as once thinking that the frame is speech frame, that is, retain this frame.

S1012: judge whether the partial log likelihood ratio is higher than local gate limit value.

The present embodiment realizes the differentiation to speech frame and noise frame by local decisions, and the local decisions of the present embodiment exist It is done in each frequency range once, 6 times altogether.Likelihood ratio is a kind of index of representation faithfulness, belongs to while reflecting sensitivity and spy The composite index of different degree, improves Probability estimate accuracy, and the present embodiment is ensuring speech frame probability value greater than noise frame probability value In the case where, further whether it is higher than local gate limit value by comparing partial log likelihood ratio, to ensure to be determined as that the voice is believed Number be speech frame accuracy.

S1013: if so, the frame data for determining that partial log likelihood ratio is higher than local gate limit value are speech frame.

The parameter of the GMM of the present embodiment has adaptive updates ability, is judged as speech frame in each frame voice signal Or after noise frame, its parameter for corresponding to model can be updated according to the characteristic value of the frame.For example, if the frame is judged as Speech frame, then the desired value, standard deviation of GMM-SPEECH and Gaussian component weighted value are just carried out according to the characteristic value of the frame primary It updates, after more and more speech frames input GMM-SPEECH, GMM-SPEECH can increasingly adapt to this logical voice signal Speaker vocal print feature, the analysis conclusion provided can be more accurate.

Further, after the step S1012 of another embodiment of the application, comprising:

S1014: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio.

The present embodiment first carries out local decisions, then carries out global decision, and global decision is the base based on local decisions result The calculating of each frequency range weighted sum is carried out on plinth, to improve the accuracy for distinguishing speech frame and noise frame.

S1015: judge whether the global log-likelihood ratio is higher than global threshold.

In the global decision of the present embodiment by global log-likelihood ratio compared with global threshold, to further increase screening The accuracy of speech frame.

S1016: if global log-likelihood ratio is higher than global threshold, determine that global log-likelihood ratio is higher than global threshold The frame data of value are speech frame.

The present embodiment can be first according to local decisions result with the presence of voice, then without global decision, to improve vocal print The efficiency of verifying, and as far as possible can recognize all speech frames, in order to avoid voice distortion.The application other embodiments can also be Local decisions result is with the presence of voice, then carries out global decision, further to verify and confirm the presence of voice, improves and distinguishes language The accuracy of sound frame and noise frame.

Further, the step S3 of the present embodiment, comprising:

S30: the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification is extracted.

The present embodiment extracts MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) The process of type vocal print feature is as follows: first sampling and quantifies, by the continuous analog voice signal of the voice data of purification with certain Sampling period sampling, be converted into discrete signal, and discrete signal is quantified as by digital signal according to certain coding rule；So Preemphasis afterwards, due to the physiological property of human body, the radio-frequency component of voice signal is often constrained, and the effect of preemphasis is that compensation is high Frequency ingredient；Then sub-frame processing, due to " the instantaneous stationarity " of voice signal, when carrying out spectrum analysis to one section of voice signal It carries out sub-frame processing (generally 10 to 30 millisecond of one frame), feature extraction is then carried out as unit of frame；Then windowing process is made With being to reduce frame starting and frame end to the discontinuity problem of induction signal, windowing process is carried out using Hamming window；Then to frame Signal carries out DFT, and signal is transformed into frequency domain from time domain, following formula is then recycled to be mapped to signal from linear spectral domain Meier spectrum domain:Frame signal after conversion is input to one group of Meier triangle filter Wave device group calculates the signal logarithmic energy of the filter output of each frequency range, obtains a logarithmic energy sequence；Previous step is obtained To logarithmic energy sequence do discrete cosine transform (DCT, Discrete Cosine Transform) the frame voice can be obtained The MFCC type vocal print feature of signal.

S31: the corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature.

MFCC type vocal print feature has nonlinear characteristic, issues the analysis result in each frequency range closer to human body true The feature of real voice extracts vocal print feature more accurate, improves the effect of voice print verification.

S32: each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain Corresponding first vocal print feature of each speech frame in the voice data of the purification.

The present embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, gauss hybrid models-background model) realize that the vocal print that each vocal print feature vector is each mapped to low dimensional reflects Other vector I-vector reduces the calculating cost in vocal print feature extraction process, reduces the use cost of voice print verification.This implementation The training process of the GMM-UBM of example is as follows: B1: obtaining the voice data sample of preset quantity (for example, 100,000), each voice Data sample corresponds to a vocal print discriminant vectors, and each voice data sample can be acquired from different people in different environments Voice is formed, and such voice data sample, which is used to training, can characterize the universal background model (GMM- of general characteristics of speech sounds UBM)；B2, each voice data sample is handled respectively to extract the corresponding preset kind of each voice data sample Vocal print feature, and each voice data sample is constructed based on the corresponding preset kind vocal print feature of each voice data sample and is corresponded to Vocal print feature vector；B3, the training set that all preset kind vocal print feature vectors constructed are divided into the first percentage and The verifying collection of second percentage, first percentage and the second percentage are less than or equal to 100%；B4, using in training set Vocal print feature vector second model is trained, and after training is completed using verifying collection to trained described the The accuracy rate of two models is verified；If B5, accuracy rate are greater than default accuracy rate (for example, 98.5%), model training terminates, Otherwise, increase voice data sample quantity, and based on the voice data sample after increase re-execute above-mentioned steps B2, B3, B4、B5。

The vocal print discriminant vectors of the present embodiment are expressed using I-vector, and i-vector is a vector, relative to Gauss For the dimension in space, i-vector dimension is lower, calculates cost convenient for reducing, and extracts the process of the i-vector of low dimensional It is that the vector w of low dimensional and a transition matrix T-phase are multiplied and are mapped to the higher Gauss sky of dimension by following calculation formula Between.The extraction of I-vector includes the following steps: to extract after the training language data process from certain target speaker To preset kind vocal print feature vector (for example, MFCC) be input to GMM-VAD model, obtain one and characterize this section of voice data The Gauss super vector of probability distribution in each Gaussian component；It is corresponding that this section of voice can be calculated using following formula Compared with the vocal print discriminant vectors I-vector:m of low dimensional_r=μ+T ω_r, whereinFor the Gauss super vector for representing this section of voice, μ For the mean value super vector of second model, T is by the I-vector of low dimensional, ω_rIt is mapped to high-dimensional Gaussian spatial The training of transition matrix, T uses EM algorithm.

Further, the step S4 of the present embodiment, comprising:

S40: obtained in the vocal print feature data of the multiple people prestored respectively it is corresponding prestore vocal print feature, In, vocal print feature is prestored including target person in the vocal print feature data of multiple people.

The present embodiment passes through the vocal print feature data for the more people including target person that will be prestored, while for judging currently to adopt Whether the vocal print feature of the voice signal of collection is identical as the vocal print feature of target person, to improve judgment accuracy.

S41: each similarity value prestored between vocal print feature and first vocal print feature is calculated separately.

The similarity value of the present embodiment characterizes the similarity prestored between vocal print feature and first vocal print feature, phase It is bigger like angle value, then it is both above-mentioned more similar.The acquisition methods of the similarity value of the present embodiment include prestoring vocal print by comparing Characteristic distance value between feature and first vocal print feature obtains, and features described above distance value includes COS distance value, European Distance value etc..

S42: each similarity value is ranked up according to sequence from big to small.

The present embodiment is by carrying out each similarity value prestored between vocal print feature and first vocal print feature It sorts from large to small, so as to the similarity distribution more accurately analyzed the first vocal print feature with respectively prestore vocal print feature, with Just the verifying to the first vocal print feature is more accurately obtained.

S43: in the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person Levy corresponding similarity value.

In similarity value of the present embodiment by the preceding preset quantity that sorts, vocal print spy is prestored including the target person Corresponding similarity value is levied, then determines that the first vocal print feature is identical as the vocal print feature of the target person prestored, to reduce model mistake The error rates such as poor bring identification, the error rates such as above-mentioned are that " the unsanctioned frequency of the verifying occurred when should be verified and should be verified not By when the frequency being verified that occurs it is equal ".The similarity value of the preset quantity of the present embodiment includes 1,2 or 3 Deng, can according to use demand carry out from set.

S44: if so, determine that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, it is no Then it is unsatisfactory for preset condition.

The application other embodiments prestore threshold at a distance from vocal print feature by setting the first vocal print feature and target user's Value, realizes effective voice print verification.Citing ground, preset threshold 0.6, if calculating prestoring for the first vocal print feature and target user The COS distance of vocal print feature is less than or equal to preset threshold, it is determined that the first vocal print feature and target user's prestores vocal print spy It levies identical, is then verified；If the COS distance for prestoring vocal print feature for calculating the first vocal print feature and target user is greater than in advance If threshold value, it is determined that the first vocal print feature and target user to prestore vocal print feature not identical, then authentication failed.

Further, the step S41 of the present embodiment, comprising:

S410: pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and institute State the COS distance value between the first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the first vocal print feature Vocal print discriminant vectors.

The present embodiment passes through COS distance formulaVocal print feature and institute are prestored described in indicating each The similarity between the first vocal print feature is stated, wherein the distance value of COS distance is smaller, shows that two vocal print features are closer or phase Together.

S411: the COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding Maximum similarity value.

The present embodiment can be by the inverse proportion formula by COS distance value according to the specified inverse ratio coefficient of carrying, by COS distance Value is converted into similarity value.

The present embodiment removes the voice number that noise data is purified by the noise data in recognition of speech signals According to, then according to purified voice data carry out Application on Voiceprint Recognition, improve the accuracy of voice print verification.The present embodiment passes through GMM- VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification The degree of voice signal further increases the accuracy of voice print verification.The present embodiment is based on GMM-UBM and realizes each vocal print Feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process This, reduces the use cost of voice print verification.The present embodiment is during voice print verification by comparing with the pre-stored data of more people Compared with analysis, reduce voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.

Referring to Fig. 2, a kind of device of voice print verification of one embodiment of the application, comprising:

Discriminating module 1 distinguishes the voice signal for the voice signal to voice print verification to be input in VAD model In speech frame and noise frame.

Module 2 is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition.

Extraction module 3, corresponding first vocal print feature of voice data for extracting the purification.

Judgment module 4, for judging it is default whether first vocal print feature and the similarity for prestoring vocal print feature meet Condition.

Determination module 5, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition It is identical, it is otherwise not identical.

Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and GMM-SPEECH, above-mentioned discriminating module 1, comprising:

Conversion unit believes the voice for the voice signal to be input in the Fourier transformation in VAD model Number it is changed into frequency-region signal form from time-domain signal form.

Discrimination unit, for each frame data of the voice signal of frequency-region signal form to be separately input to the GMM- VAD judgement is carried out in NOISE and GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.

Further, the discrimination unit of the present embodiment, comprising:

Subelement is inputted, for being separately input to GMM-NOISE for each frame data of the voice signal of frequency-region signal form In GMM-SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability

First computation subunit is used for basis Calculate partial log likelihood ratio.

First judgment sub-unit, for judging whether the partial log likelihood ratio is higher than local gate limit value.

First determines subelement, if being higher than local gate limit value for partial log likelihood ratio, determines partial log likelihood Frame data than being higher than local gate limit value are speech frame.

Further, the discrimination unit of another embodiment of the application, comprising:

Second computation subunit, if being not higher than local gate limit value, basis for partial log likelihood ratioCalculate global log-likelihood ratio.

Second judgment sub-unit, for judging whether the global log-likelihood ratio is higher than global threshold.

Second determines subelement, if being higher than global threshold for global log-likelihood ratio, determines global log-likelihood Frame data than being higher than global threshold are speech frame.

Further, the extraction module 3 of the present embodiment, comprising:

Extraction unit, the corresponding MFCC type sound of each speech frame in the voice data for extracting the purification Line feature.

Construction unit, for constructing the corresponding vocal print of each speech frame according to each MFCC type vocal print feature Feature vector.

Map unit, for each vocal print feature vector to be each mapped to the vocal print discriminant vectors I- of low dimensional Vector, to obtain corresponding first vocal print feature of each speech frame in the voice data of the purification.

Further, the judgment module 4 of the present embodiment, comprising:

Acquiring unit corresponding prestores vocal print for obtaining in the vocal print feature data of the multiple people prestored respectively Feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people.

Computing unit, for calculating separately each similarity prestored between vocal print feature and first vocal print feature Value.

Sequencing unit, for each similarity value to be ranked up according to sequence from big to small.

Judging unit, in the similarity value for the preceding preset quantity that judges to sort, if including the target person Prestore the corresponding similarity value of vocal print feature.

Judging unit, if for including that the target person prestores the corresponding similarity value of vocal print feature, determine described in First vocal print feature and the similarity for prestoring vocal print feature meet preset condition, are otherwise unsatisfactory for preset condition.

Further, the computing unit of the present embodiment, comprising:

Third computation subunit, for passing through COS distance formula respectivelyIt calculates each described Prestore the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y Represent the vocal print discriminant vectors of the first vocal print feature.

Conversion subunit, for the COS distance value to be converted into the similarity value, wherein the smallest cosine Distance value corresponds to maximum similarity value.

Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be server, Its internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, network connected by system bus Interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment is deposited Reservoir includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program And database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium.It should The database of computer equipment is used to store all data that the process of voice print verification needs.The network interface of the computer equipment For being communicated with external end by network connection.The side of voice print verification is realized when the computer program is executed by processor Method.

The method that above-mentioned processor executes above-mentioned voice print verification, comprising: the voice signal to voice print verification is input to VAD In model, the speech frame and noise frame in the voice signal are distinguished；The noise frame is removed, each speech frame composition is obtained Purification voice data；Extract corresponding first vocal print feature of voice data of the purification；Judge that first vocal print is special Whether the similarity for levying and prestoring vocal print feature meets preset condition；If satisfied, then determine first vocal print feature with it is described It is identical to prestore vocal print feature, it is otherwise not identical.

Above-mentioned computer equipment by the noise data in recognition of speech signals, and removes what noise data was purified Then voice data carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.Pass through GMM- VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification The degree of voice signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM by each vocal print feature vector It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced The use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, vocal print is reduced Verifying etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.

In one embodiment, in the VAD model include Fourier transformation, Gaussian Mixture be distributed GMM-NOISE and The voice signal is input in VAD model by GMM-SPEECH, above-mentioned processor, distinguish voice signal in each speech frame and The step of each noise frame, comprising: the voice signal is input in the Fourier transformation in VAD model, the voice is believed Number it is changed into frequency-region signal form from time-domain signal form；Each frame data difference of the voice signal of frequency-region signal form is defeated Enter into the GMM-NOISE and GMM-SPEECH and carry out VAD judgement, to distinguish the speech frame and noise frame in voice signal.

In one embodiment, above-mentioned processor inputs each frame data of the voice signal of frequency-region signal form respectively VAD judgement is carried out, into the GMM-NOISE and GMM-SPEECH to distinguish the step of speech frame and noise frame in voice signal Suddenly, comprising: each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, Respectively obtain the noise frame probability of each frame dataWith speech frame probabilityAccording toCalculate partial log likelihood Than；Judge whether the partial log likelihood ratio is higher than local gate limit value；If so, determining that the partial log likelihood ratio is higher than The frame data of local gate limit value are speech frame.

In one embodiment, above-mentioned processor judge the step of whether log-likelihood ratio is higher than local gate limit value it Afterwards, comprising: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio； Judge whether the global log-likelihood ratio is higher than global threshold；If global log-likelihood ratio is higher than global threshold, sentence The frame data that the fixed global log-likelihood ratio is higher than global threshold are speech frame.

In one embodiment, above-mentioned processor extracts the step of corresponding first vocal print feature of voice data of the purification Suddenly, comprising: extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification；According to each The MFCC type vocal print feature constructs the corresponding vocal print feature vector of each speech frame；By each vocal print feature to Amount is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain each voice in the voice data of the purification Corresponding first vocal print feature of frame.

In one embodiment, above-mentioned processor judges that first vocal print feature is with the similarity for prestoring vocal print feature No the step of meeting preset condition, comprising: obtained in the vocal print feature data of the multiple people prestored respectively corresponding pre- Deposit vocal print feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people；It calculates separately each described Prestore the similarity value between vocal print feature and first vocal print feature；Each similarity value is suitable according to from big to small Sequence is ranked up；In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person Levy corresponding similarity value；If so, determining that first vocal print feature meets default item with the similarity for prestoring vocal print feature Otherwise part is unsatisfactory for preset condition.

In one embodiment, above-mentioned processor, which calculates separately, each described prestores vocal print feature and first vocal print feature Between similarity value the step of, comprising: respectively pass through COS distance formulaIt calculates each described pre- Deposit the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y generation The vocal print discriminant vectors of the first vocal print feature of table；The COS distance value is converted into the similarity value, wherein the smallest institute It states COS distance value and corresponds to maximum similarity value.

It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.

One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates The method of voice print verification is realized when machine program is executed by processor, comprising: the voice signal to voice print verification is input to VAD mould In type, the speech frame and noise frame in the voice signal are distinguished；The noise frame is removed, each speech frame composition is obtained The voice data of purification；Extract corresponding first vocal print feature of voice data of the purification；Judge first vocal print feature Whether meet preset condition with the similarity for prestoring vocal print feature；If satisfied, then determine first vocal print feature with it is described pre- It is identical to deposit vocal print feature, it is otherwise not identical.

Above-mentioned computer readable storage medium by the noise data in recognition of speech signals, and removes noise data and obtains To the voice data of purification, Application on Voiceprint Recognition then is carried out according to purified voice data, improves the accuracy of voice print verification.It is logical GMM-VAD model is crossed, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to mention The degree of high clean speech signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM each vocal print is special Sign vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process This, reduces the use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, drop Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, above-mentioned computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.

The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations Technical field, similarly include in the scope of patent protection of the application.

Claims

1. a kind of method of voice print verification characterized by comprising

Voice signal to voice print verification is input in VAD model, the speech frame and noise frame in the voice signal are distinguished；

2. the method for voice print verification according to claim 1, which is characterized in that include that Fourier becomes in the VAD model It changes, the GMM-NOISE and GMM-SPEECH of Gaussian Mixture distribution, it is described that the voice signal is input in VAD model, it distinguishes The step of speech frame and noise frame in voice signal, comprising:

The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal form It is changed into frequency-region signal form；

Each frame data of the voice signal of frequency-region signal form are separately input in the GMM-NOISE and GMM-SPEECH VAD judgement is carried out, to distinguish the speech frame and noise frame in voice signal.

3. the method for voice print verification according to claim 2, which is characterized in that the voice by frequency-region signal form is believed Number each frame data be separately input in the GMM-NOISE and GMM-SPEECH carry out VAD judgement, to distinguish voice signal In speech frame and the step of noise frame, comprising:

According toIt is right to calculate part Number likelihood ratio；

4. the method for voice print verification according to claim 3, which is characterized in that described whether to judge the log-likelihood ratio After the step of higher than local gate limit value, comprising:

If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than the frame of global threshold Data are speech frame.

5. the method for voice print verification according to claim 1, which is characterized in that the voice data for extracting the purification The step of corresponding first vocal print feature, comprising:

Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain the purification Voice data in corresponding first vocal print feature of each speech frame.

6. the method for voice print verification according to claim 5, which is characterized in that the judgement first vocal print feature with The step of whether similarity of vocal print feature meets preset condition prestored, comprising:

It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein Duo Geren Vocal print feature data in include that target person prestores vocal print feature；

Each similarity value is ranked up according to sequence from big to small；

Judgement sort preceding preset quantity similarity value in, if including the target person to prestore vocal print feature corresponding Similarity value；

If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise it is unsatisfactory for Preset condition.

7. the method for voice print verification according to claim 6, which is characterized in that described calculate separately each described prestores vocal print The step of similarity value between feature and first vocal print feature, comprising:

Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and first sound COS distance value between line feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the vocal print mirror of the first vocal print feature Other vector；

The COS distance value is converted into the similarity value, wherein the smallest COS distance value corresponds to maximum phase Like angle value.

8. a kind of device of voice print verification characterized by comprising

Discriminating module distinguishes the language in the voice signal for the voice signal to voice print verification to be input in VAD model Sound frame and noise frame；

Judgment module, for judging whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature；

Determination module, if determining that first vocal print feature is identical as the vocal print feature that prestores for meeting preset condition, Otherwise not identical.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.