CN109378002A - Method, apparatus, computer equipment and the storage medium of voice print verification - Google Patents

Method, apparatus, computer equipment and the storage medium of voice print verification Download PDF

Info

Publication number
CN109378002A
CN109378002A CN201811184693.9A CN201811184693A CN109378002A CN 109378002 A CN109378002 A CN 109378002A CN 201811184693 A CN201811184693 A CN 201811184693A CN 109378002 A CN109378002 A CN 109378002A
Authority
CN
China
Prior art keywords
vocal print
voice
print feature
frame
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811184693.9A
Other languages
Chinese (zh)
Other versions
CN109378002B (en
Inventor
杨翘楚
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811184693.9A priority Critical patent/CN109378002B/en
Priority to PCT/CN2018/124401 priority patent/WO2020073518A1/en
Publication of CN109378002A publication Critical patent/CN109378002A/en
Application granted granted Critical
Publication of CN109378002B publication Critical patent/CN109378002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

This application discloses the method, apparatus of voice print verification, computer equipment and storage mediums, wherein the method for voice print verification, comprising: the voice signal to voice print verification is input in VAD model, distinguishes the speech frame and noise frame in the voice signal;The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;Extract corresponding first vocal print feature of voice data of the purification;Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.The application removes the voice data that noise data is purified by the noise data in recognition of speech signals, then carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.

Description

Method, apparatus, computer equipment and the storage medium of voice print verification
Technical field
This application involves voice print verification field is arrived, especially relate to the method, apparatus of voice print verification, computer equipment and Storage medium.
Background technique
Currently, the scope of business of many large size financing corporations is related to multiple business such as insurance, bank, investment, and it is every A business usually requires same client and links up, and requires to carry out instead cheating identification, therefore, tests the identity of client Card and anti-fraud identification also just become the important component for guaranteeing service security.In client identity verifying link, vocal print is tested Demonstrate,prove the real-time having due to it and easily just property and used by many companies.In practical applications, by speaker in identity registration Or such environmental effects locating for authentication link, collected voice data are made an uproar often with the non-background from speaker Sound, this factor become one of the principal element for influencing voice print verification success rate.
Summary of the invention
The main purpose of the application is to provide a kind of method of voice print verification, it is intended to solve the noise in existing voice data The technical issues of adverse effect is generated to voice print verification effect.
A kind of method that the application proposes voice print verification, comprising:
Voice signal to voice print verification is input in VAD model, the speech frame in the voice signal is distinguished and is made an uproar Sound frame;
The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;
Extract corresponding first vocal print feature of voice data of the purification;
Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
It preferably, include the GMM-NOISE and GMM- of Fourier transformation, Gaussian Mixture distribution in the VAD model SPEECH, it is described that the voice signal is input in VAD model, distinguish the step of the speech frame and noise frame in voice signal Suddenly, comprising:
The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal Form is changed into frequency-region signal form;
Each frame data of the voice signal of frequency-region signal form are separately input to the GMM-NOISE and GMM- VAD judgement is carried out in SPEECH, to distinguish the speech frame and noise frame in voice signal.
Preferably, each frame data of the voice signal by frequency-region signal form are separately input to the GMM- VAD judgement is carried out in NOISE and GMM-SPEECH, the step of to distinguish the speech frame and noise frame in voice signal, comprising:
Each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, Respectively obtain the noise frame probability of each frame dataWith speech frame probability
According toCalculating office Portion's log-likelihood ratio;
Judge whether the partial log likelihood ratio is higher than local gate limit value;
If so, the frame data for determining that the partial log likelihood ratio is higher than local gate limit value are speech frame.
Preferably, it is described judge the step of whether log-likelihood ratio is higher than local gate limit value after, comprising:
If partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;
If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than global threshold Frame data be speech frame.
Preferably, the step of voice data corresponding first vocal print feature for extracting the purification, comprising:
Extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;
The corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature;
Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, it is described to obtain Corresponding first vocal print feature of each speech frame in the voice data of purification.
Preferably, described to judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature The step of, comprising:
It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein is more Vocal print feature is prestored including target person in personal vocal print feature data;
Calculate separately each similarity value prestored between vocal print feature and first vocal print feature;
Each similarity value is ranked up according to sequence from big to small;
In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print feature pair including the target person The similarity value answered;
If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise not Meet preset condition.
Preferably, described to calculate separately each similarity value prestored between vocal print feature and first vocal print feature The step of, comprising:
Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and described the COS distance value between one vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the sound of the first vocal print feature Line discriminant vectors;
The COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding maximum Similarity value.
Present invention also provides a kind of devices of voice print verification, comprising:
Discriminating module is distinguished in the voice signal for the voice signal to voice print verification to be input in VAD model Speech frame and noise frame;
Module is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition;
Extraction module, corresponding first vocal print feature of voice data for extracting the purification;
Judgment module, for judging whether first vocal print feature meets default item with the similarity for prestoring vocal print feature Part;
Determination module, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition It is identical, it is otherwise not identical.
Present invention also provides a kind of computer equipment, including memory and processor, the memory is stored with calculating The step of machine program, the processor realizes the above method when executing the computer program.
Present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, which is characterized in that The computer program realizes the step of above-mentioned method when being executed by processor.
The application removes the voice data that noise data is purified by the noise data in recognition of speech signals, Then Application on Voiceprint Recognition is carried out according to purified voice data, improves the accuracy of voice print verification.The application passes through GMM-VAD mould Type, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification voice letter Number degree, further increase the accuracy of voice print verification.The application is based on GMM-UBM and realizes each vocal print feature vector It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced The use cost of voice print verification.The application, by being compared analysis with the pre-stored data of more people, drops during voice print verification Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
Detailed description of the invention
The method flow schematic diagram of the voice print verification of one embodiment of Fig. 1 the application;
The apparatus structure schematic diagram of the voice print verification of one embodiment of Fig. 2 the application;
The computer equipment schematic diagram of internal structure of one embodiment of Fig. 3 the application.
Specific embodiment
It should be appreciated that specific embodiment described herein is only used to explain the application, it is not used to limit the application.
Referring to Fig.1, the method for a kind of voice print verification of one embodiment of the application, comprising:
S1: the voice signal to voice print verification is input in VAD model, distinguish speech frame in the voice signal and Noise frame.
The VAD model of the present embodiment, also known as speech terminals detection device, for detecting whether that there are voice in noise circumstance Voice data.VAD model by giving a mark to each frame voice signal of input, i.e., the frame voice signal be speech frame or The probability of noise frame, when speech frame probability value be greater than preset decision threshold, then be determined as speech frame, be otherwise noise Frame.VAD model distinguishes speech frame and noise frame according to above-mentioned court verdict, to remove the noise in voice signal Frame.The decision threshold of the present embodiment uses the decision threshold defaulted in Webrtc source code, which is Webrtc skill It is got when art is developed by analyzing mass data, to improve the effect and accuracy distinguished, and reduces the mould of VAD model simultaneously Type training amount.
S2: removing the noise frame, obtains the voice data of the purification of each speech frame composition.
The present embodiment is according to above-mentioned differentiation as a result, by cutting off labeled as the data of noise frame, by remaining each institute's predicate Sound frame forms the voice data of the purification of each speech frame composition according to the successively continuous arrangement of former time of allocation sequence. The application other embodiments can also by above-mentioned differentiation as a result, selection markers be speech frame data extract preservation, will mention Each speech frame deposited go bail for according to the successively continuous arrangement of former time of allocation sequence, forms the described of each speech frame composition The voice data of purification.The present embodiment is by coming from speaker for the non-of the environment locating for identity registration or authentication link Background noise data get rid of, influence of the noise data to voice print verification effect in voice signal is reduced, to improve vocal print It is proved to be successful rate.
S3: corresponding first vocal print feature of voice data of the purification is extracted.
The present embodiment is by corresponding first vocal print feature of voice data that only analysis purifies, to reduce in voice print verification Calculation amount, and the validity, specific aim and timeliness of voice print verification are improved simultaneously.
S4: judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature.
The preset condition of the present embodiment includes specified preset threshold range or specified sequence etc., can be according to specific Application scenarios carry out customized setting, broadly to meet personalized use demand.
S5: if satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
The present embodiment will determine that first vocal print feature prestores that vocal print feature is identical with described, then feedback validation passes through As a result client is arrived, otherwise, the result of feedback validation failure is to client, so that client carries out further according to feedback result Application operating.Citing ground controls intelligent door opening etc. after being verified.With illustrating again, it is controlled after authentication failed predetermined number of times Security system carries out screen locking, further destroys e-banking system to prevent offender.
Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and GMM-SPEECH, step S1, comprising:
S100: the voice signal is input in the Fourier transformation in VAD model, by the voice signal from time domain Signal form is changed into frequency-region signal form.
Time-domain signal form is converted into frequency domain letter by the Fourier transformation in VAD model by the present embodiment correspondingly Number form facilitates to carry out analyzing the attribute of each frame voice signal and distinguishes speech frame and noise frame.
S101: by each frame data of the voice signal of frequency-region signal form be separately input to the GMM-NOISE and VAD judgement is carried out in GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, and to noise and voice at 6 Gaussian Mixture distribution GMM modeling is carried out in frequency range respectively, is had in each frequency range containing there are two the noise GMM- of Gaussian component NOISE and containing there are two the voice GMM-SPEECH of Gaussian component.Above-mentioned 6 frequency ranges are based on noise and language according to Webrtc technology The frequency spectrum difference of sound is configured, to improve accuracy of analysis and the matching with Webrtc technology.Other realities of the application The analysis frequency range for applying example must be not necessarily 6, can be set according to actual needs.And the present embodiment is exchanged based on China Electric standard is 220V, 50Hz, and the interference of power supply 50Hz can be mixed into the microphone of acquisition voice signal, collected interference signal And physical shock can affect, the voice signal of the present embodiment preferred acquisition 80Hz or more, to reduce the dry of alternating current It disturbs, and the attainable highest frequency of voice is 4kHz, so frequency spectrum trough of the present embodiment preferably within the scope of 80Hz to 4kHz Locate division limits.The VAD judgement of the present embodiment includes local decisions (Local Decision) and global decision (Global Decisioin)。
Further, the step S101 of the present embodiment, comprising:
S1010: by each frame data of the voice signal of frequency-region signal form, GMM-NOISE and GMM- are separately input to In SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability
The present embodiment by be by preanalysis speech frame or noise frame voice signal each frame data, be separately input to In GMM-NOISE and GMM-SPEECH, the noise frame probability that GMM-NOISE and GMM-SPEECH analyzes each frame data respectively is obtained Value and speech frame probability value, so as to by comparing noise frame probability value and speech frame probability value size, so that determination is noise Frame or speech frame.
S1011: according to Calculate partial log likelihood ratio.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, so n value in the present embodiment It is 6, when judgement each frame, can all carries out 6 local decisions, i.e., carry out local decisions respectively in 6 frequency ranges, As long as once thinking that the frame is speech frame, that is, retain this frame.
S1012: judge whether the partial log likelihood ratio is higher than local gate limit value.
The present embodiment realizes the differentiation to speech frame and noise frame by local decisions, and the local decisions of the present embodiment exist It is done in each frequency range once, 6 times altogether.Likelihood ratio is a kind of index of representation faithfulness, belongs to while reflecting sensitivity and spy The composite index of different degree, improves Probability estimate accuracy, and the present embodiment is ensuring speech frame probability value greater than noise frame probability value In the case where, further whether it is higher than local gate limit value by comparing partial log likelihood ratio, to ensure to be determined as that the voice is believed Number be speech frame accuracy.
S1013: if so, the frame data for determining that partial log likelihood ratio is higher than local gate limit value are speech frame.
The parameter of the GMM of the present embodiment has adaptive updates ability, is judged as speech frame in each frame voice signal Or after noise frame, its parameter for corresponding to model can be updated according to the characteristic value of the frame.For example, if the frame is judged as Speech frame, then the desired value, standard deviation of GMM-SPEECH and Gaussian component weighted value are just carried out according to the characteristic value of the frame primary It updates, after more and more speech frames input GMM-SPEECH, GMM-SPEECH can increasingly adapt to this logical voice signal Speaker vocal print feature, the analysis conclusion provided can be more accurate.
Further, after the step S1012 of another embodiment of the application, comprising:
S1014: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio.
The present embodiment first carries out local decisions, then carries out global decision, and global decision is the base based on local decisions result The calculating of each frequency range weighted sum is carried out on plinth, to improve the accuracy for distinguishing speech frame and noise frame.
S1015: judge whether the global log-likelihood ratio is higher than global threshold.
In the global decision of the present embodiment by global log-likelihood ratio compared with global threshold, to further increase screening The accuracy of speech frame.
S1016: if global log-likelihood ratio is higher than global threshold, determine that global log-likelihood ratio is higher than global threshold The frame data of value are speech frame.
The present embodiment can be first according to local decisions result with the presence of voice, then without global decision, to improve vocal print The efficiency of verifying, and as far as possible can recognize all speech frames, in order to avoid voice distortion.The application other embodiments can also be Local decisions result is with the presence of voice, then carries out global decision, further to verify and confirm the presence of voice, improves and distinguishes language The accuracy of sound frame and noise frame.
Further, the step S3 of the present embodiment, comprising:
S30: the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification is extracted.
The present embodiment extracts MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) The process of type vocal print feature is as follows: first sampling and quantifies, by the continuous analog voice signal of the voice data of purification with certain Sampling period sampling, be converted into discrete signal, and discrete signal is quantified as by digital signal according to certain coding rule;So Preemphasis afterwards, due to the physiological property of human body, the radio-frequency component of voice signal is often constrained, and the effect of preemphasis is that compensation is high Frequency ingredient;Then sub-frame processing, due to " the instantaneous stationarity " of voice signal, when carrying out spectrum analysis to one section of voice signal It carries out sub-frame processing (generally 10 to 30 millisecond of one frame), feature extraction is then carried out as unit of frame;Then windowing process is made With being to reduce frame starting and frame end to the discontinuity problem of induction signal, windowing process is carried out using Hamming window;Then to frame Signal carries out DFT, and signal is transformed into frequency domain from time domain, following formula is then recycled to be mapped to signal from linear spectral domain Meier spectrum domain:Frame signal after conversion is input to one group of Meier triangle filter Wave device group calculates the signal logarithmic energy of the filter output of each frequency range, obtains a logarithmic energy sequence;Previous step is obtained To logarithmic energy sequence do discrete cosine transform (DCT, Discrete Cosine Transform) the frame voice can be obtained The MFCC type vocal print feature of signal.
S31: the corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature.
MFCC type vocal print feature has nonlinear characteristic, issues the analysis result in each frequency range closer to human body true The feature of real voice extracts vocal print feature more accurate, improves the effect of voice print verification.
S32: each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain Corresponding first vocal print feature of each speech frame in the voice data of the purification.
The present embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, gauss hybrid models-background model) realize that the vocal print that each vocal print feature vector is each mapped to low dimensional reflects Other vector I-vector reduces the calculating cost in vocal print feature extraction process, reduces the use cost of voice print verification.This implementation The training process of the GMM-UBM of example is as follows: B1: obtaining the voice data sample of preset quantity (for example, 100,000), each voice Data sample corresponds to a vocal print discriminant vectors, and each voice data sample can be acquired from different people in different environments Voice is formed, and such voice data sample, which is used to training, can characterize the universal background model (GMM- of general characteristics of speech sounds UBM);B2, each voice data sample is handled respectively to extract the corresponding preset kind of each voice data sample Vocal print feature, and each voice data sample is constructed based on the corresponding preset kind vocal print feature of each voice data sample and is corresponded to Vocal print feature vector;B3, the training set that all preset kind vocal print feature vectors constructed are divided into the first percentage and The verifying collection of second percentage, first percentage and the second percentage are less than or equal to 100%;B4, using in training set Vocal print feature vector second model is trained, and after training is completed using verifying collection to trained described the The accuracy rate of two models is verified;If B5, accuracy rate are greater than default accuracy rate (for example, 98.5%), model training terminates, Otherwise, increase voice data sample quantity, and based on the voice data sample after increase re-execute above-mentioned steps B2, B3, B4、B5。
The vocal print discriminant vectors of the present embodiment are expressed using I-vector, and i-vector is a vector, relative to Gauss For the dimension in space, i-vector dimension is lower, calculates cost convenient for reducing, and extracts the process of the i-vector of low dimensional It is that the vector w of low dimensional and a transition matrix T-phase are multiplied and are mapped to the higher Gauss sky of dimension by following calculation formula Between.The extraction of I-vector includes the following steps: to extract after the training language data process from certain target speaker To preset kind vocal print feature vector (for example, MFCC) be input to GMM-VAD model, obtain one and characterize this section of voice data The Gauss super vector of probability distribution in each Gaussian component;It is corresponding that this section of voice can be calculated using following formula Compared with the vocal print discriminant vectors I-vector:m of low dimensionalr=μ+T ωr, whereinFor the Gauss super vector for representing this section of voice, μ For the mean value super vector of second model, T is by the I-vector of low dimensional, ωrIt is mapped to high-dimensional Gaussian spatial The training of transition matrix, T uses EM algorithm.
Further, the step S4 of the present embodiment, comprising:
S40: obtained in the vocal print feature data of the multiple people prestored respectively it is corresponding prestore vocal print feature, In, vocal print feature is prestored including target person in the vocal print feature data of multiple people.
The present embodiment passes through the vocal print feature data for the more people including target person that will be prestored, while for judging currently to adopt Whether the vocal print feature of the voice signal of collection is identical as the vocal print feature of target person, to improve judgment accuracy.
S41: each similarity value prestored between vocal print feature and first vocal print feature is calculated separately.
The similarity value of the present embodiment characterizes the similarity prestored between vocal print feature and first vocal print feature, phase It is bigger like angle value, then it is both above-mentioned more similar.The acquisition methods of the similarity value of the present embodiment include prestoring vocal print by comparing Characteristic distance value between feature and first vocal print feature obtains, and features described above distance value includes COS distance value, European Distance value etc..
S42: each similarity value is ranked up according to sequence from big to small.
The present embodiment is by carrying out each similarity value prestored between vocal print feature and first vocal print feature It sorts from large to small, so as to the similarity distribution more accurately analyzed the first vocal print feature with respectively prestore vocal print feature, with Just the verifying to the first vocal print feature is more accurately obtained.
S43: in the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person Levy corresponding similarity value.
In similarity value of the present embodiment by the preceding preset quantity that sorts, vocal print spy is prestored including the target person Corresponding similarity value is levied, then determines that the first vocal print feature is identical as the vocal print feature of the target person prestored, to reduce model mistake The error rates such as poor bring identification, the error rates such as above-mentioned are that " the unsanctioned frequency of the verifying occurred when should be verified and should be verified not By when the frequency being verified that occurs it is equal ".The similarity value of the preset quantity of the present embodiment includes 1,2 or 3 Deng, can according to use demand carry out from set.
S44: if so, determine that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, it is no Then it is unsatisfactory for preset condition.
The application other embodiments prestore threshold at a distance from vocal print feature by setting the first vocal print feature and target user's Value, realizes effective voice print verification.Citing ground, preset threshold 0.6, if calculating prestoring for the first vocal print feature and target user The COS distance of vocal print feature is less than or equal to preset threshold, it is determined that the first vocal print feature and target user's prestores vocal print spy It levies identical, is then verified;If the COS distance for prestoring vocal print feature for calculating the first vocal print feature and target user is greater than in advance If threshold value, it is determined that the first vocal print feature and target user to prestore vocal print feature not identical, then authentication failed.
Further, the step S41 of the present embodiment, comprising:
S410: pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and institute State the COS distance value between the first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the first vocal print feature Vocal print discriminant vectors.
The present embodiment passes through COS distance formulaVocal print feature and institute are prestored described in indicating each The similarity between the first vocal print feature is stated, wherein the distance value of COS distance is smaller, shows that two vocal print features are closer or phase Together.
S411: the COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding Maximum similarity value.
The present embodiment can be by the inverse proportion formula by COS distance value according to the specified inverse ratio coefficient of carrying, by COS distance Value is converted into similarity value.
The present embodiment removes the voice number that noise data is purified by the noise data in recognition of speech signals According to, then according to purified voice data carry out Application on Voiceprint Recognition, improve the accuracy of voice print verification.The present embodiment passes through GMM- VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification The degree of voice signal further increases the accuracy of voice print verification.The present embodiment is based on GMM-UBM and realizes each vocal print Feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process This, reduces the use cost of voice print verification.The present embodiment is during voice print verification by comparing with the pre-stored data of more people Compared with analysis, reduce voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
Referring to Fig. 2, a kind of device of voice print verification of one embodiment of the application, comprising:
Discriminating module 1 distinguishes the voice signal for the voice signal to voice print verification to be input in VAD model In speech frame and noise frame.
The VAD model of the present embodiment, also known as speech terminals detection device, for detecting whether that there are voice in noise circumstance Voice data.VAD model by giving a mark to each frame voice signal of input, i.e., the frame voice signal be speech frame or The probability of noise frame, when speech frame probability value be greater than preset decision threshold, then be determined as speech frame, be otherwise noise Frame.VAD model distinguishes speech frame and noise frame according to above-mentioned court verdict, to remove the noise in voice signal Frame.The decision threshold of the present embodiment uses the decision threshold defaulted in Webrtc source code, which is Webrtc skill It is got when art is developed by analyzing mass data, to improve the effect and accuracy distinguished, and reduces the mould of VAD model simultaneously Type training amount.
Module 2 is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition.
The present embodiment is according to above-mentioned differentiation as a result, by cutting off labeled as the data of noise frame, by remaining each institute's predicate Sound frame forms the voice data of the purification of each speech frame composition according to the successively continuous arrangement of former time of allocation sequence. The application other embodiments can also by above-mentioned differentiation as a result, selection markers be speech frame data extract preservation, will mention Each speech frame deposited go bail for according to the successively continuous arrangement of former time of allocation sequence, forms the described of each speech frame composition The voice data of purification.The present embodiment is by coming from speaker for the non-of the environment locating for identity registration or authentication link Background noise data get rid of, influence of the noise data to voice print verification effect in voice signal is reduced, to improve vocal print It is proved to be successful rate.
Extraction module 3, corresponding first vocal print feature of voice data for extracting the purification.
The present embodiment is by corresponding first vocal print feature of voice data that only analysis purifies, to reduce in voice print verification Calculation amount, and the validity, specific aim and timeliness of voice print verification are improved simultaneously.
Judgment module 4, for judging it is default whether first vocal print feature and the similarity for prestoring vocal print feature meet Condition.
The preset condition of the present embodiment includes specified preset threshold range or specified sequence etc., can be according to specific Application scenarios carry out customized setting, broadly to meet personalized use demand.
Determination module 5, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition It is identical, it is otherwise not identical.
The present embodiment will determine that first vocal print feature prestores that vocal print feature is identical with described, then feedback validation passes through As a result client is arrived, otherwise, the result of feedback validation failure is to client, so that client carries out further according to feedback result Application operating.Citing ground controls intelligent door opening etc. after being verified.With illustrating again, it is controlled after authentication failed predetermined number of times Security system carries out screen locking, further destroys e-banking system to prevent offender.
Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and GMM-SPEECH, above-mentioned discriminating module 1, comprising:
Conversion unit believes the voice for the voice signal to be input in the Fourier transformation in VAD model Number it is changed into frequency-region signal form from time-domain signal form.
Time-domain signal form is converted into frequency domain letter by the Fourier transformation in VAD model by the present embodiment correspondingly Number form facilitates to carry out analyzing the attribute of each frame voice signal and distinguishes speech frame and noise frame.
Discrimination unit, for each frame data of the voice signal of frequency-region signal form to be separately input to the GMM- VAD judgement is carried out in NOISE and GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, and to noise and voice at 6 Gaussian Mixture distribution GMM modeling is carried out in frequency range respectively, is had in each frequency range containing there are two the noise GMM- of Gaussian component NOISE and containing there are two the voice GMM-SPEECH of Gaussian component.Above-mentioned 6 frequency ranges are based on noise and language according to Webrtc technology The frequency spectrum difference of sound is configured, to improve accuracy of analysis and the matching with Webrtc technology.Other realities of the application The analysis frequency range for applying example must be not necessarily 6, can be set according to actual needs.And the present embodiment is exchanged based on China Electric standard is 220V, 50Hz, and the interference of power supply 50Hz can be mixed into the microphone of acquisition voice signal, collected interference signal And physical shock can affect, the voice signal of the present embodiment preferred acquisition 80Hz or more, to reduce the dry of alternating current It disturbs, and the attainable highest frequency of voice is 4kHz, so frequency spectrum trough of the present embodiment preferably within the scope of 80Hz to 4kHz Locate division limits.The VAD judgement of the present embodiment includes local decisions (Local Decision) and global decision (Global Decisioin)。
Further, the discrimination unit of the present embodiment, comprising:
Subelement is inputted, for being separately input to GMM-NOISE for each frame data of the voice signal of frequency-region signal form In GMM-SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability
The present embodiment by be by preanalysis speech frame or noise frame voice signal each frame data, be separately input to In GMM-NOISE and GMM-SPEECH, the noise frame probability that GMM-NOISE and GMM-SPEECH analyzes each frame data respectively is obtained Value and speech frame probability value, so as to by comparing noise frame probability value and speech frame probability value size, so that determination is noise Frame or speech frame.
First computation subunit is used for basis Calculate partial log likelihood ratio.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, so n value in the present embodiment It is 6, when judgement each frame, can all carries out 6 local decisions, i.e., carry out local decisions respectively in 6 frequency ranges, As long as once thinking that the frame is speech frame, that is, retain this frame.
First judgment sub-unit, for judging whether the partial log likelihood ratio is higher than local gate limit value.
The present embodiment realizes the differentiation to speech frame and noise frame by local decisions, and the local decisions of the present embodiment exist It is done in each frequency range once, 6 times altogether.Likelihood ratio is a kind of index of representation faithfulness, belongs to while reflecting sensitivity and spy The composite index of different degree, improves Probability estimate accuracy, and the present embodiment is ensuring speech frame probability value greater than noise frame probability value In the case where, further whether it is higher than local gate limit value by comparing partial log likelihood ratio, to ensure to be determined as that the voice is believed Number be speech frame accuracy.
First determines subelement, if being higher than local gate limit value for partial log likelihood ratio, determines partial log likelihood Frame data than being higher than local gate limit value are speech frame.
The parameter of the GMM of the present embodiment has adaptive updates ability, is judged as speech frame in each frame voice signal Or after noise frame, its parameter for corresponding to model can be updated according to the characteristic value of the frame.For example, if the frame is judged as Speech frame, then the desired value, standard deviation of GMM-SPEECH and Gaussian component weighted value are just carried out according to the characteristic value of the frame primary It updates, after more and more speech frames input GMM-SPEECH, GMM-SPEECH can increasingly adapt to this logical voice signal Speaker vocal print feature, the analysis conclusion provided can be more accurate.
Further, the discrimination unit of another embodiment of the application, comprising:
Second computation subunit, if being not higher than local gate limit value, basis for partial log likelihood ratioCalculate global log-likelihood ratio.
The present embodiment first carries out local decisions, then carries out global decision, and global decision is the base based on local decisions result The calculating of each frequency range weighted sum is carried out on plinth, to improve the accuracy for distinguishing speech frame and noise frame.
Second judgment sub-unit, for judging whether the global log-likelihood ratio is higher than global threshold.
In the global decision of the present embodiment by global log-likelihood ratio compared with global threshold, to further increase screening The accuracy of speech frame.
Second determines subelement, if being higher than global threshold for global log-likelihood ratio, determines global log-likelihood Frame data than being higher than global threshold are speech frame.
The present embodiment can be first according to local decisions result with the presence of voice, then without global decision, to improve vocal print The efficiency of verifying, and as far as possible can recognize all speech frames, in order to avoid voice distortion.The application other embodiments can also be Local decisions result is with the presence of voice, then carries out global decision, further to verify and confirm the presence of voice, improves and distinguishes language The accuracy of sound frame and noise frame.
Further, the extraction module 3 of the present embodiment, comprising:
Extraction unit, the corresponding MFCC type sound of each speech frame in the voice data for extracting the purification Line feature.
The present embodiment extracts MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) The process of type vocal print feature is as follows: first sampling and quantifies, by the continuous analog voice signal of the voice data of purification with certain Sampling period sampling, be converted into discrete signal, and discrete signal is quantified as by digital signal according to certain coding rule;So Preemphasis afterwards, due to the physiological property of human body, the radio-frequency component of voice signal is often constrained, and the effect of preemphasis is that compensation is high Frequency ingredient;Then sub-frame processing, due to " the instantaneous stationarity " of voice signal, when carrying out spectrum analysis to one section of voice signal It carries out sub-frame processing (generally 10 to 30 millisecond of one frame), feature extraction is then carried out as unit of frame;Then windowing process is made With being to reduce frame starting and frame end to the discontinuity problem of induction signal, windowing process is carried out using Hamming window;Then to frame Signal carries out DFT, and signal is transformed into frequency domain from time domain, following formula is then recycled to be mapped to signal from linear spectral domain Meier spectrum domain:Frame signal after conversion is input to one group of Meier triangle filter Wave device group calculates the signal logarithmic energy of the filter output of each frequency range, obtains a logarithmic energy sequence;Previous step is obtained To logarithmic energy sequence do discrete cosine transform (DCT, Discrete Cosine Transform) the frame voice can be obtained The MFCC type vocal print feature of signal.
Construction unit, for constructing the corresponding vocal print of each speech frame according to each MFCC type vocal print feature Feature vector.
MFCC type vocal print feature has nonlinear characteristic, issues the analysis result in each frequency range closer to human body true The feature of real voice extracts vocal print feature more accurate, improves the effect of voice print verification.
Map unit, for each vocal print feature vector to be each mapped to the vocal print discriminant vectors I- of low dimensional Vector, to obtain corresponding first vocal print feature of each speech frame in the voice data of the purification.
The present embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, gauss hybrid models-background model) realize that the vocal print that each vocal print feature vector is each mapped to low dimensional reflects Other vector I-vector reduces the calculating cost in vocal print feature extraction process, reduces the use cost of voice print verification.This implementation The training process of the GMM-UBM of example is as follows: B1: obtaining the voice data sample of preset quantity (for example, 100,000), each voice Data sample corresponds to a vocal print discriminant vectors, and each voice data sample can be acquired from different people in different environments Voice is formed, and such voice data sample, which is used to training, can characterize the universal background model (GMM- of general characteristics of speech sounds UBM);B2, each voice data sample is handled respectively to extract the corresponding preset kind of each voice data sample Vocal print feature, and each voice data sample is constructed based on the corresponding preset kind vocal print feature of each voice data sample and is corresponded to Vocal print feature vector;B3, the training set that all preset kind vocal print feature vectors constructed are divided into the first percentage and The verifying collection of second percentage, first percentage and the second percentage are less than or equal to 100%;B4, using in training set Vocal print feature vector second model is trained, and after training is completed using verifying collection to trained described the The accuracy rate of two models is verified;If B5, accuracy rate are greater than default accuracy rate (for example, 98.5%), model training terminates, Otherwise, increase voice data sample quantity, and based on the voice data sample after increase re-execute above-mentioned steps B2, B3, B4、B5。
The vocal print discriminant vectors of the present embodiment are expressed using I-vector, and i-vector is a vector, relative to Gauss For the dimension in space, i-vector dimension is lower, calculates cost convenient for reducing, and extracts the process of the i-vector of low dimensional It is that the vector w of low dimensional and a transition matrix T-phase are multiplied and are mapped to the higher Gauss sky of dimension by following calculation formula Between.The extraction of I-vector includes the following steps: to extract after the training language data process from certain target speaker To preset kind vocal print feature vector (for example, MFCC) be input to GMM-VAD model, obtain one and characterize this section of voice data The Gauss super vector of probability distribution in each Gaussian component;It is corresponding that this section of voice can be calculated using following formula Compared with the vocal print discriminant vectors I-vector:m of low dimensionalr=μ+T ωr, whereinFor the Gauss super vector for representing this section of voice, μ For the mean value super vector of second model, T is by the I-vector of low dimensional, ωrIt is mapped to high-dimensional Gaussian spatial The training of transition matrix, T uses EM algorithm.
Further, the judgment module 4 of the present embodiment, comprising:
Acquiring unit corresponding prestores vocal print for obtaining in the vocal print feature data of the multiple people prestored respectively Feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people.
The present embodiment passes through the vocal print feature data for the more people including target person that will be prestored, while for judging currently to adopt Whether the vocal print feature of the voice signal of collection is identical as the vocal print feature of target person, to improve judgment accuracy.
Computing unit, for calculating separately each similarity prestored between vocal print feature and first vocal print feature Value.
The similarity value of the present embodiment characterizes the similarity prestored between vocal print feature and first vocal print feature, phase It is bigger like angle value, then it is both above-mentioned more similar.The acquisition methods of the similarity value of the present embodiment include prestoring vocal print by comparing Characteristic distance value between feature and first vocal print feature obtains, and features described above distance value includes COS distance value, European Distance value etc..
Sequencing unit, for each similarity value to be ranked up according to sequence from big to small.
The present embodiment is by carrying out each similarity value prestored between vocal print feature and first vocal print feature It sorts from large to small, so as to the similarity distribution more accurately analyzed the first vocal print feature with respectively prestore vocal print feature, with Just the verifying to the first vocal print feature is more accurately obtained.
Judging unit, in the similarity value for the preceding preset quantity that judges to sort, if including the target person Prestore the corresponding similarity value of vocal print feature.
In similarity value of the present embodiment by the preceding preset quantity that sorts, vocal print spy is prestored including the target person Corresponding similarity value is levied, then determines that the first vocal print feature is identical as the vocal print feature of the target person prestored, to reduce model mistake The error rates such as poor bring identification, the error rates such as above-mentioned are that " the unsanctioned frequency of the verifying occurred when should be verified and should be verified not By when the frequency being verified that occurs it is equal ".The similarity value of the preset quantity of the present embodiment includes 1,2 or 3 Deng, can according to use demand carry out from set.
Judging unit, if for including that the target person prestores the corresponding similarity value of vocal print feature, determine described in First vocal print feature and the similarity for prestoring vocal print feature meet preset condition, are otherwise unsatisfactory for preset condition.
The application other embodiments prestore threshold at a distance from vocal print feature by setting the first vocal print feature and target user's Value, realizes effective voice print verification.Citing ground, preset threshold 0.6, if calculating prestoring for the first vocal print feature and target user The COS distance of vocal print feature is less than or equal to preset threshold, it is determined that the first vocal print feature and target user's prestores vocal print spy It levies identical, is then verified;If the COS distance for prestoring vocal print feature for calculating the first vocal print feature and target user is greater than in advance If threshold value, it is determined that the first vocal print feature and target user to prestore vocal print feature not identical, then authentication failed.
Further, the computing unit of the present embodiment, comprising:
Third computation subunit, for passing through COS distance formula respectivelyIt calculates each described Prestore the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y Represent the vocal print discriminant vectors of the first vocal print feature.
The present embodiment passes through COS distance formulaVocal print feature and institute are prestored described in indicating each The similarity between the first vocal print feature is stated, wherein the distance value of COS distance is smaller, shows that two vocal print features are closer or phase Together.
Conversion subunit, for the COS distance value to be converted into the similarity value, wherein the smallest cosine Distance value corresponds to maximum similarity value.
The present embodiment can be by the inverse proportion formula by COS distance value according to the specified inverse ratio coefficient of carrying, by COS distance Value is converted into similarity value.
Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be server, Its internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, network connected by system bus Interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment is deposited Reservoir includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program And database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium.It should The database of computer equipment is used to store all data that the process of voice print verification needs.The network interface of the computer equipment For being communicated with external end by network connection.The side of voice print verification is realized when the computer program is executed by processor Method.
The method that above-mentioned processor executes above-mentioned voice print verification, comprising: the voice signal to voice print verification is input to VAD In model, the speech frame and noise frame in the voice signal are distinguished;The noise frame is removed, each speech frame composition is obtained Purification voice data;Extract corresponding first vocal print feature of voice data of the purification;Judge that first vocal print is special Whether the similarity for levying and prestoring vocal print feature meets preset condition;If satisfied, then determine first vocal print feature with it is described It is identical to prestore vocal print feature, it is otherwise not identical.
Above-mentioned computer equipment by the noise data in recognition of speech signals, and removes what noise data was purified Then voice data carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.Pass through GMM- VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification The degree of voice signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM by each vocal print feature vector It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced The use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, vocal print is reduced Verifying etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
In one embodiment, in the VAD model include Fourier transformation, Gaussian Mixture be distributed GMM-NOISE and The voice signal is input in VAD model by GMM-SPEECH, above-mentioned processor, distinguish voice signal in each speech frame and The step of each noise frame, comprising: the voice signal is input in the Fourier transformation in VAD model, the voice is believed Number it is changed into frequency-region signal form from time-domain signal form;Each frame data difference of the voice signal of frequency-region signal form is defeated Enter into the GMM-NOISE and GMM-SPEECH and carry out VAD judgement, to distinguish the speech frame and noise frame in voice signal.
In one embodiment, above-mentioned processor inputs each frame data of the voice signal of frequency-region signal form respectively VAD judgement is carried out, into the GMM-NOISE and GMM-SPEECH to distinguish the step of speech frame and noise frame in voice signal Suddenly, comprising: each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, Respectively obtain the noise frame probability of each frame dataWith speech frame probabilityAccording toCalculate partial log likelihood Than;Judge whether the partial log likelihood ratio is higher than local gate limit value;If so, determining that the partial log likelihood ratio is higher than The frame data of local gate limit value are speech frame.
In one embodiment, above-mentioned processor judge the step of whether log-likelihood ratio is higher than local gate limit value it Afterwards, comprising: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio; Judge whether the global log-likelihood ratio is higher than global threshold;If global log-likelihood ratio is higher than global threshold, sentence The frame data that the fixed global log-likelihood ratio is higher than global threshold are speech frame.
In one embodiment, above-mentioned processor extracts the step of corresponding first vocal print feature of voice data of the purification Suddenly, comprising: extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;According to each The MFCC type vocal print feature constructs the corresponding vocal print feature vector of each speech frame;By each vocal print feature to Amount is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain each voice in the voice data of the purification Corresponding first vocal print feature of frame.
In one embodiment, above-mentioned processor judges that first vocal print feature is with the similarity for prestoring vocal print feature No the step of meeting preset condition, comprising: obtained in the vocal print feature data of the multiple people prestored respectively corresponding pre- Deposit vocal print feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people;It calculates separately each described Prestore the similarity value between vocal print feature and first vocal print feature;Each similarity value is suitable according to from big to small Sequence is ranked up;In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person Levy corresponding similarity value;If so, determining that first vocal print feature meets default item with the similarity for prestoring vocal print feature Otherwise part is unsatisfactory for preset condition.
In one embodiment, above-mentioned processor, which calculates separately, each described prestores vocal print feature and first vocal print feature Between similarity value the step of, comprising: respectively pass through COS distance formulaIt calculates each described pre- Deposit the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y generation The vocal print discriminant vectors of the first vocal print feature of table;The COS distance value is converted into the similarity value, wherein the smallest institute It states COS distance value and corresponds to maximum similarity value.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates The method of voice print verification is realized when machine program is executed by processor, comprising: the voice signal to voice print verification is input to VAD mould In type, the speech frame and noise frame in the voice signal are distinguished;The noise frame is removed, each speech frame composition is obtained The voice data of purification;Extract corresponding first vocal print feature of voice data of the purification;Judge first vocal print feature Whether meet preset condition with the similarity for prestoring vocal print feature;If satisfied, then determine first vocal print feature with it is described pre- It is identical to deposit vocal print feature, it is otherwise not identical.
Above-mentioned computer readable storage medium by the noise data in recognition of speech signals, and removes noise data and obtains To the voice data of purification, Application on Voiceprint Recognition then is carried out according to purified voice data, improves the accuracy of voice print verification.It is logical GMM-VAD model is crossed, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to mention The degree of high clean speech signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM each vocal print is special Sign vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process This, reduces the use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, drop Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
In one embodiment, in the VAD model include Fourier transformation, Gaussian Mixture be distributed GMM-NOISE and The voice signal is input in VAD model by GMM-SPEECH, above-mentioned processor, distinguish voice signal in each speech frame and The step of each noise frame, comprising: the voice signal is input in the Fourier transformation in VAD model, the voice is believed Number it is changed into frequency-region signal form from time-domain signal form;Each frame data difference of the voice signal of frequency-region signal form is defeated Enter into the GMM-NOISE and GMM-SPEECH and carry out VAD judgement, to distinguish the speech frame and noise frame in voice signal.
In one embodiment, above-mentioned processor inputs each frame data of the voice signal of frequency-region signal form respectively VAD judgement is carried out, into the GMM-NOISE and GMM-SPEECH to distinguish the step of speech frame and noise frame in voice signal Suddenly, comprising: each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, Respectively obtain the noise frame probability of each frame dataWith speech frame probabilityAccording toCalculate partial log likelihood Than;Judge whether the partial log likelihood ratio is higher than local gate limit value;If so, determining that the partial log likelihood ratio is higher than The frame data of local gate limit value are speech frame.
In one embodiment, above-mentioned processor judge the step of whether log-likelihood ratio is higher than local gate limit value it Afterwards, comprising: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio; Judge whether the global log-likelihood ratio is higher than global threshold;If global log-likelihood ratio is higher than global threshold, sentence The frame data that the fixed global log-likelihood ratio is higher than global threshold are speech frame.
In one embodiment, above-mentioned processor extracts the step of corresponding first vocal print feature of voice data of the purification Suddenly, comprising: extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;According to each The MFCC type vocal print feature constructs the corresponding vocal print feature vector of each speech frame;By each vocal print feature to Amount is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain each voice in the voice data of the purification Corresponding first vocal print feature of frame.
In one embodiment, above-mentioned processor judges that first vocal print feature is with the similarity for prestoring vocal print feature No the step of meeting preset condition, comprising: obtained in the vocal print feature data of the multiple people prestored respectively corresponding pre- Deposit vocal print feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people;It calculates separately each described Prestore the similarity value between vocal print feature and first vocal print feature;Each similarity value is suitable according to from big to small Sequence is ranked up;In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person Levy corresponding similarity value;If so, determining that first vocal print feature meets default item with the similarity for prestoring vocal print feature Otherwise part is unsatisfactory for preset condition.
In one embodiment, above-mentioned processor, which calculates separately, each described prestores vocal print feature and first vocal print feature Between similarity value the step of, comprising: respectively pass through COS distance formulaIt calculates each described pre- Deposit the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y generation The vocal print discriminant vectors of the first vocal print feature of table;The COS distance value is converted into the similarity value, wherein the smallest institute It states COS distance value and corresponds to maximum similarity value.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, above-mentioned computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.
The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations Technical field, similarly include in the scope of patent protection of the application.

Claims (10)

1. a kind of method of voice print verification characterized by comprising
Voice signal to voice print verification is input in VAD model, the speech frame and noise frame in the voice signal are distinguished;
The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;
Extract corresponding first vocal print feature of voice data of the purification;
Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
2. the method for voice print verification according to claim 1, which is characterized in that include that Fourier becomes in the VAD model It changes, the GMM-NOISE and GMM-SPEECH of Gaussian Mixture distribution, it is described that the voice signal is input in VAD model, it distinguishes The step of speech frame and noise frame in voice signal, comprising:
The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal form It is changed into frequency-region signal form;
Each frame data of the voice signal of frequency-region signal form are separately input in the GMM-NOISE and GMM-SPEECH VAD judgement is carried out, to distinguish the speech frame and noise frame in voice signal.
3. the method for voice print verification according to claim 2, which is characterized in that the voice by frequency-region signal form is believed Number each frame data be separately input in the GMM-NOISE and GMM-SPEECH carry out VAD judgement, to distinguish voice signal In speech frame and the step of noise frame, comprising:
Each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, respectively Obtain the noise frame probability of each frame dataWith speech frame probability
According toIt is right to calculate part Number likelihood ratio;
Judge whether the partial log likelihood ratio is higher than local gate limit value;
If so, the frame data for determining that the partial log likelihood ratio is higher than local gate limit value are speech frame.
4. the method for voice print verification according to claim 3, which is characterized in that described whether to judge the log-likelihood ratio After the step of higher than local gate limit value, comprising:
If partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;
If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than the frame of global threshold Data are speech frame.
5. the method for voice print verification according to claim 1, which is characterized in that the voice data for extracting the purification The step of corresponding first vocal print feature, comprising:
Extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;
The corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature;
Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain the purification Voice data in corresponding first vocal print feature of each speech frame.
6. the method for voice print verification according to claim 5, which is characterized in that the judgement first vocal print feature with The step of whether similarity of vocal print feature meets preset condition prestored, comprising:
It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein Duo Geren Vocal print feature data in include that target person prestores vocal print feature;
Calculate separately each similarity value prestored between vocal print feature and first vocal print feature;
Each similarity value is ranked up according to sequence from big to small;
Judgement sort preceding preset quantity similarity value in, if including the target person to prestore vocal print feature corresponding Similarity value;
If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise it is unsatisfactory for Preset condition.
7. the method for voice print verification according to claim 6, which is characterized in that described calculate separately each described prestores vocal print The step of similarity value between feature and first vocal print feature, comprising:
Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and first sound COS distance value between line feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the vocal print mirror of the first vocal print feature Other vector;
The COS distance value is converted into the similarity value, wherein the smallest COS distance value corresponds to maximum phase Like angle value.
8. a kind of device of voice print verification characterized by comprising
Discriminating module distinguishes the language in the voice signal for the voice signal to voice print verification to be input in VAD model Sound frame and noise frame;
Module is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition;
Extraction module, corresponding first vocal print feature of voice data for extracting the purification;
Judgment module, for judging whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
Determination module, if determining that first vocal print feature is identical as the vocal print feature that prestores for meeting preset condition, Otherwise not identical.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
CN201811184693.9A 2018-10-11 2018-10-11 Voiceprint verification method, voiceprint verification device, computer equipment and storage medium Active CN109378002B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811184693.9A CN109378002B (en) 2018-10-11 2018-10-11 Voiceprint verification method, voiceprint verification device, computer equipment and storage medium
PCT/CN2018/124401 WO2020073518A1 (en) 2018-10-11 2018-12-27 Voiceprint verification method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811184693.9A CN109378002B (en) 2018-10-11 2018-10-11 Voiceprint verification method, voiceprint verification device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109378002A true CN109378002A (en) 2019-02-22
CN109378002B CN109378002B (en) 2024-05-07

Family

ID=65403684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811184693.9A Active CN109378002B (en) 2018-10-11 2018-10-11 Voiceprint verification method, voiceprint verification device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN109378002B (en)
WO (1) WO2020073518A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265012A (en) * 2019-06-19 2019-09-20 泉州师范学院 It can interactive intelligence voice home control device and control method based on open source hardware
CN110675878A (en) * 2019-09-23 2020-01-10 金瓜子科技发展(北京)有限公司 Method and device for identifying vehicle and merchant, storage medium and electronic equipment
CN111274434A (en) * 2020-01-16 2020-06-12 上海携程国际旅行社有限公司 Audio corpus automatic labeling method, system, medium and electronic equipment
CN112331217A (en) * 2020-11-02 2021-02-05 泰康保险集团股份有限公司 Voiceprint recognition method and device, storage medium and electronic equipment
CN112735433A (en) * 2020-12-29 2021-04-30 平安普惠企业管理有限公司 Identity verification method, device, equipment and storage medium
WO2021098153A1 (en) * 2019-11-18 2021-05-27 锐迪科微电子科技(上海)有限公司 Method, system, and electronic apparatus for detecting change of target user, and storage medium
WO2021127990A1 (en) * 2019-12-24 2021-07-01 广州国音智能科技有限公司 Voiceprint recognition method based on voice noise reduction and related apparatus
CN113488059A (en) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 Voiceprint recognition method and system
WO2022068675A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Speaker speech extraction method and apparatus, storage medium, and electronic device
JP2022536190A (en) * 2020-04-28 2022-08-12 平安科技(深▲せん▼)有限公司 Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN105575406A (en) * 2016-01-07 2016-05-11 深圳市音加密科技有限公司 Noise robustness detection method based on likelihood ratio test
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition
CN108109612A (en) * 2017-12-07 2018-06-01 苏州大学 A kind of speech recognition sorting technique based on self-adaptive reduced-dimensions
CN108154371A (en) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 Electronic device, the method for authentication and storage medium
CN108172230A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model
CN108428456A (en) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 Voice de-noising algorithm

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6349278B1 (en) * 1999-08-04 2002-02-19 Ericsson Inc. Soft decision signal estimation
CN103236260B (en) * 2013-03-29 2015-08-12 京东方科技集团股份有限公司 Speech recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN105575406A (en) * 2016-01-07 2016-05-11 深圳市音加密科技有限公司 Noise robustness detection method based on likelihood ratio test
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition
CN108109612A (en) * 2017-12-07 2018-06-01 苏州大学 A kind of speech recognition sorting technique based on self-adaptive reduced-dimensions
CN108172230A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model
CN108154371A (en) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 Electronic device, the method for authentication and storage medium
CN108428456A (en) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 Voice de-noising algorithm

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265012A (en) * 2019-06-19 2019-09-20 泉州师范学院 It can interactive intelligence voice home control device and control method based on open source hardware
CN110675878A (en) * 2019-09-23 2020-01-10 金瓜子科技发展(北京)有限公司 Method and device for identifying vehicle and merchant, storage medium and electronic equipment
WO2021098153A1 (en) * 2019-11-18 2021-05-27 锐迪科微电子科技(上海)有限公司 Method, system, and electronic apparatus for detecting change of target user, and storage medium
WO2021127990A1 (en) * 2019-12-24 2021-07-01 广州国音智能科技有限公司 Voiceprint recognition method based on voice noise reduction and related apparatus
CN111274434A (en) * 2020-01-16 2020-06-12 上海携程国际旅行社有限公司 Audio corpus automatic labeling method, system, medium and electronic equipment
JP2022536190A (en) * 2020-04-28 2022-08-12 平安科技(深▲せん▼)有限公司 Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium
JP7184236B2 (en) 2020-04-28 2022-12-06 平安科技(深▲せん▼)有限公司 Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium
WO2022068675A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Speaker speech extraction method and apparatus, storage medium, and electronic device
CN112331217A (en) * 2020-11-02 2021-02-05 泰康保险集团股份有限公司 Voiceprint recognition method and device, storage medium and electronic equipment
CN112331217B (en) * 2020-11-02 2023-09-12 泰康保险集团股份有限公司 Voiceprint recognition method and device, storage medium and electronic equipment
CN112735433A (en) * 2020-12-29 2021-04-30 平安普惠企业管理有限公司 Identity verification method, device, equipment and storage medium
CN113488059A (en) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 Voiceprint recognition method and system

Also Published As

Publication number Publication date
WO2020073518A1 (en) 2020-04-16
CN109378002B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
CN109378002A (en) Method, apparatus, computer equipment and the storage medium of voice print verification
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN109346086A (en) Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
CN104978507B (en) A kind of Intelligent controller for logging evaluation expert system identity identifying method based on Application on Voiceprint Recognition
CN103971690A (en) Voiceprint recognition method and device
CN110443692A (en) Enterprise's credit authorization method, apparatus, equipment and computer readable storage medium
JP2008509432A (en) Method and system for verifying and enabling user access based on voice parameters
US20070198262A1 (en) Topological voiceprints for speaker identification
CN109473105A (en) The voice print verification method, apparatus unrelated with text and computer equipment
CN107346568A (en) The authentication method and device of a kind of gate control system
CN110164453A (en) A kind of method for recognizing sound-groove, terminal, server and the storage medium of multi-model fusion
CN108154371A (en) Electronic device, the method for authentication and storage medium
CN111081223B (en) Voice recognition method, device, equipment and storage medium
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
Revathi et al. Person authentication using speech as a biometric against play back attacks
Chaudhari et al. Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction
Mubeen et al. Detection of impostor and tampered segments in audio by using an intelligent system
CN115102789A (en) Anti-communication network fraud studying, judging, early-warning and intercepting comprehensive platform
CN110188338A (en) The relevant method for identifying speaker of text and equipment
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
Hossan et al. Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization
CN112992155A (en) Far-field voice speaker recognition method and device based on residual error neural network
Gupta et al. Text dependent voice based biometric authentication system using spectrum analysis and image acquisition
TWI778234B (en) Speaker verification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant