CN109378002A - Method, apparatus, computer equipment and the storage medium of voice print verification - Google Patents
Method, apparatus, computer equipment and the storage medium of voice print verification Download PDFInfo
- Publication number
- CN109378002A CN109378002A CN201811184693.9A CN201811184693A CN109378002A CN 109378002 A CN109378002 A CN 109378002A CN 201811184693 A CN201811184693 A CN 201811184693A CN 109378002 A CN109378002 A CN 109378002A
- Authority
- CN
- China
- Prior art keywords
- vocal print
- voice
- print feature
- frame
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012795 verification Methods 0.000 title claims abstract description 76
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000001755 vocal effect Effects 0.000 claims abstract description 295
- 238000000746 purification Methods 0.000 claims abstract description 44
- 239000000203 mixture Substances 0.000 claims abstract description 23
- 239000000284 extract Substances 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 94
- 238000000605 extraction Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 14
- 230000009466 transformation Effects 0.000 claims description 13
- 238000012549 training Methods 0.000 description 18
- 238000004458 analytical method Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 10
- 230000004069 differentiation Effects 0.000 description 6
- 238000005070 sampling Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 108010001267 Protein Subunits Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 230000001766 physiological effect Effects 0.000 description 2
- 238000004321 preservation Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000035939 shock Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000010183 spectrum analysis Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
This application discloses the method, apparatus of voice print verification, computer equipment and storage mediums, wherein the method for voice print verification, comprising: the voice signal to voice print verification is input in VAD model, distinguishes the speech frame and noise frame in the voice signal;The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;Extract corresponding first vocal print feature of voice data of the purification;Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.The application removes the voice data that noise data is purified by the noise data in recognition of speech signals, then carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.
Description
Technical field
This application involves voice print verification field is arrived, especially relate to the method, apparatus of voice print verification, computer equipment and
Storage medium.
Background technique
Currently, the scope of business of many large size financing corporations is related to multiple business such as insurance, bank, investment, and it is every
A business usually requires same client and links up, and requires to carry out instead cheating identification, therefore, tests the identity of client
Card and anti-fraud identification also just become the important component for guaranteeing service security.In client identity verifying link, vocal print is tested
Demonstrate,prove the real-time having due to it and easily just property and used by many companies.In practical applications, by speaker in identity registration
Or such environmental effects locating for authentication link, collected voice data are made an uproar often with the non-background from speaker
Sound, this factor become one of the principal element for influencing voice print verification success rate.
Summary of the invention
The main purpose of the application is to provide a kind of method of voice print verification, it is intended to solve the noise in existing voice data
The technical issues of adverse effect is generated to voice print verification effect.
A kind of method that the application proposes voice print verification, comprising:
Voice signal to voice print verification is input in VAD model, the speech frame in the voice signal is distinguished and is made an uproar
Sound frame;
The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;
Extract corresponding first vocal print feature of voice data of the purification;
Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
It preferably, include the GMM-NOISE and GMM- of Fourier transformation, Gaussian Mixture distribution in the VAD model
SPEECH, it is described that the voice signal is input in VAD model, distinguish the step of the speech frame and noise frame in voice signal
Suddenly, comprising:
The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal
Form is changed into frequency-region signal form;
Each frame data of the voice signal of frequency-region signal form are separately input to the GMM-NOISE and GMM-
VAD judgement is carried out in SPEECH, to distinguish the speech frame and noise frame in voice signal.
Preferably, each frame data of the voice signal by frequency-region signal form are separately input to the GMM-
VAD judgement is carried out in NOISE and GMM-SPEECH, the step of to distinguish the speech frame and noise frame in voice signal, comprising:
Each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH,
Respectively obtain the noise frame probability of each frame dataWith speech frame probability
According toCalculating office
Portion's log-likelihood ratio;
Judge whether the partial log likelihood ratio is higher than local gate limit value;
If so, the frame data for determining that the partial log likelihood ratio is higher than local gate limit value are speech frame.
Preferably, it is described judge the step of whether log-likelihood ratio is higher than local gate limit value after, comprising:
If partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;
If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than global threshold
Frame data be speech frame.
Preferably, the step of voice data corresponding first vocal print feature for extracting the purification, comprising:
Extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;
The corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature;
Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, it is described to obtain
Corresponding first vocal print feature of each speech frame in the voice data of purification.
Preferably, described to judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature
The step of, comprising:
It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein is more
Vocal print feature is prestored including target person in personal vocal print feature data;
Calculate separately each similarity value prestored between vocal print feature and first vocal print feature;
Each similarity value is ranked up according to sequence from big to small;
In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print feature pair including the target person
The similarity value answered;
If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise not
Meet preset condition.
Preferably, described to calculate separately each similarity value prestored between vocal print feature and first vocal print feature
The step of, comprising:
Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and described the
COS distance value between one vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the sound of the first vocal print feature
Line discriminant vectors;
The COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding maximum
Similarity value.
Present invention also provides a kind of devices of voice print verification, comprising:
Discriminating module is distinguished in the voice signal for the voice signal to voice print verification to be input in VAD model
Speech frame and noise frame;
Module is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition;
Extraction module, corresponding first vocal print feature of voice data for extracting the purification;
Judgment module, for judging whether first vocal print feature meets default item with the similarity for prestoring vocal print feature
Part;
Determination module, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition
It is identical, it is otherwise not identical.
Present invention also provides a kind of computer equipment, including memory and processor, the memory is stored with calculating
The step of machine program, the processor realizes the above method when executing the computer program.
Present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, which is characterized in that
The computer program realizes the step of above-mentioned method when being executed by processor.
The application removes the voice data that noise data is purified by the noise data in recognition of speech signals,
Then Application on Voiceprint Recognition is carried out according to purified voice data, improves the accuracy of voice print verification.The application passes through GMM-VAD mould
Type, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification voice letter
Number degree, further increase the accuracy of voice print verification.The application is based on GMM-UBM and realizes each vocal print feature vector
It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced
The use cost of voice print verification.The application, by being compared analysis with the pre-stored data of more people, drops during voice print verification
Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
Detailed description of the invention
The method flow schematic diagram of the voice print verification of one embodiment of Fig. 1 the application;
The apparatus structure schematic diagram of the voice print verification of one embodiment of Fig. 2 the application;
The computer equipment schematic diagram of internal structure of one embodiment of Fig. 3 the application.
Specific embodiment
It should be appreciated that specific embodiment described herein is only used to explain the application, it is not used to limit the application.
Referring to Fig.1, the method for a kind of voice print verification of one embodiment of the application, comprising:
S1: the voice signal to voice print verification is input in VAD model, distinguish speech frame in the voice signal and
Noise frame.
The VAD model of the present embodiment, also known as speech terminals detection device, for detecting whether that there are voice in noise circumstance
Voice data.VAD model by giving a mark to each frame voice signal of input, i.e., the frame voice signal be speech frame or
The probability of noise frame, when speech frame probability value be greater than preset decision threshold, then be determined as speech frame, be otherwise noise
Frame.VAD model distinguishes speech frame and noise frame according to above-mentioned court verdict, to remove the noise in voice signal
Frame.The decision threshold of the present embodiment uses the decision threshold defaulted in Webrtc source code, which is Webrtc skill
It is got when art is developed by analyzing mass data, to improve the effect and accuracy distinguished, and reduces the mould of VAD model simultaneously
Type training amount.
S2: removing the noise frame, obtains the voice data of the purification of each speech frame composition.
The present embodiment is according to above-mentioned differentiation as a result, by cutting off labeled as the data of noise frame, by remaining each institute's predicate
Sound frame forms the voice data of the purification of each speech frame composition according to the successively continuous arrangement of former time of allocation sequence.
The application other embodiments can also by above-mentioned differentiation as a result, selection markers be speech frame data extract preservation, will mention
Each speech frame deposited go bail for according to the successively continuous arrangement of former time of allocation sequence, forms the described of each speech frame composition
The voice data of purification.The present embodiment is by coming from speaker for the non-of the environment locating for identity registration or authentication link
Background noise data get rid of, influence of the noise data to voice print verification effect in voice signal is reduced, to improve vocal print
It is proved to be successful rate.
S3: corresponding first vocal print feature of voice data of the purification is extracted.
The present embodiment is by corresponding first vocal print feature of voice data that only analysis purifies, to reduce in voice print verification
Calculation amount, and the validity, specific aim and timeliness of voice print verification are improved simultaneously.
S4: judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature.
The preset condition of the present embodiment includes specified preset threshold range or specified sequence etc., can be according to specific
Application scenarios carry out customized setting, broadly to meet personalized use demand.
S5: if satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
The present embodiment will determine that first vocal print feature prestores that vocal print feature is identical with described, then feedback validation passes through
As a result client is arrived, otherwise, the result of feedback validation failure is to client, so that client carries out further according to feedback result
Application operating.Citing ground controls intelligent door opening etc. after being verified.With illustrating again, it is controlled after authentication failed predetermined number of times
Security system carries out screen locking, further destroys e-banking system to prevent offender.
Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and
GMM-SPEECH, step S1, comprising:
S100: the voice signal is input in the Fourier transformation in VAD model, by the voice signal from time domain
Signal form is changed into frequency-region signal form.
Time-domain signal form is converted into frequency domain letter by the Fourier transformation in VAD model by the present embodiment correspondingly
Number form facilitates to carry out analyzing the attribute of each frame voice signal and distinguishes speech frame and noise frame.
S101: by each frame data of the voice signal of frequency-region signal form be separately input to the GMM-NOISE and
VAD judgement is carried out in GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input
Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, and to noise and voice at 6
Gaussian Mixture distribution GMM modeling is carried out in frequency range respectively, is had in each frequency range containing there are two the noise GMM- of Gaussian component
NOISE and containing there are two the voice GMM-SPEECH of Gaussian component.Above-mentioned 6 frequency ranges are based on noise and language according to Webrtc technology
The frequency spectrum difference of sound is configured, to improve accuracy of analysis and the matching with Webrtc technology.Other realities of the application
The analysis frequency range for applying example must be not necessarily 6, can be set according to actual needs.And the present embodiment is exchanged based on China
Electric standard is 220V, 50Hz, and the interference of power supply 50Hz can be mixed into the microphone of acquisition voice signal, collected interference signal
And physical shock can affect, the voice signal of the present embodiment preferred acquisition 80Hz or more, to reduce the dry of alternating current
It disturbs, and the attainable highest frequency of voice is 4kHz, so frequency spectrum trough of the present embodiment preferably within the scope of 80Hz to 4kHz
Locate division limits.The VAD judgement of the present embodiment includes local decisions (Local Decision) and global decision (Global
Decisioin)。
Further, the step S101 of the present embodiment, comprising:
S1010: by each frame data of the voice signal of frequency-region signal form, GMM-NOISE and GMM- are separately input to
In SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability
The present embodiment by be by preanalysis speech frame or noise frame voice signal each frame data, be separately input to
In GMM-NOISE and GMM-SPEECH, the noise frame probability that GMM-NOISE and GMM-SPEECH analyzes each frame data respectively is obtained
Value and speech frame probability value, so as to by comparing noise frame probability value and speech frame probability value size, so that determination is noise
Frame or speech frame.
S1011: according to
Calculate partial log likelihood ratio.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input
Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, so n value in the present embodiment
It is 6, when judgement each frame, can all carries out 6 local decisions, i.e., carry out local decisions respectively in 6 frequency ranges,
As long as once thinking that the frame is speech frame, that is, retain this frame.
S1012: judge whether the partial log likelihood ratio is higher than local gate limit value.
The present embodiment realizes the differentiation to speech frame and noise frame by local decisions, and the local decisions of the present embodiment exist
It is done in each frequency range once, 6 times altogether.Likelihood ratio is a kind of index of representation faithfulness, belongs to while reflecting sensitivity and spy
The composite index of different degree, improves Probability estimate accuracy, and the present embodiment is ensuring speech frame probability value greater than noise frame probability value
In the case where, further whether it is higher than local gate limit value by comparing partial log likelihood ratio, to ensure to be determined as that the voice is believed
Number be speech frame accuracy.
S1013: if so, the frame data for determining that partial log likelihood ratio is higher than local gate limit value are speech frame.
The parameter of the GMM of the present embodiment has adaptive updates ability, is judged as speech frame in each frame voice signal
Or after noise frame, its parameter for corresponding to model can be updated according to the characteristic value of the frame.For example, if the frame is judged as
Speech frame, then the desired value, standard deviation of GMM-SPEECH and Gaussian component weighted value are just carried out according to the characteristic value of the frame primary
It updates, after more and more speech frames input GMM-SPEECH, GMM-SPEECH can increasingly adapt to this logical voice signal
Speaker vocal print feature, the analysis conclusion provided can be more accurate.
Further, after the step S1012 of another embodiment of the application, comprising:
S1014: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio.
The present embodiment first carries out local decisions, then carries out global decision, and global decision is the base based on local decisions result
The calculating of each frequency range weighted sum is carried out on plinth, to improve the accuracy for distinguishing speech frame and noise frame.
S1015: judge whether the global log-likelihood ratio is higher than global threshold.
In the global decision of the present embodiment by global log-likelihood ratio compared with global threshold, to further increase screening
The accuracy of speech frame.
S1016: if global log-likelihood ratio is higher than global threshold, determine that global log-likelihood ratio is higher than global threshold
The frame data of value are speech frame.
The present embodiment can be first according to local decisions result with the presence of voice, then without global decision, to improve vocal print
The efficiency of verifying, and as far as possible can recognize all speech frames, in order to avoid voice distortion.The application other embodiments can also be
Local decisions result is with the presence of voice, then carries out global decision, further to verify and confirm the presence of voice, improves and distinguishes language
The accuracy of sound frame and noise frame.
Further, the step S3 of the present embodiment, comprising:
S30: the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification is extracted.
The present embodiment extracts MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient)
The process of type vocal print feature is as follows: first sampling and quantifies, by the continuous analog voice signal of the voice data of purification with certain
Sampling period sampling, be converted into discrete signal, and discrete signal is quantified as by digital signal according to certain coding rule;So
Preemphasis afterwards, due to the physiological property of human body, the radio-frequency component of voice signal is often constrained, and the effect of preemphasis is that compensation is high
Frequency ingredient;Then sub-frame processing, due to " the instantaneous stationarity " of voice signal, when carrying out spectrum analysis to one section of voice signal
It carries out sub-frame processing (generally 10 to 30 millisecond of one frame), feature extraction is then carried out as unit of frame;Then windowing process is made
With being to reduce frame starting and frame end to the discontinuity problem of induction signal, windowing process is carried out using Hamming window;Then to frame
Signal carries out DFT, and signal is transformed into frequency domain from time domain, following formula is then recycled to be mapped to signal from linear spectral domain
Meier spectrum domain:Frame signal after conversion is input to one group of Meier triangle filter
Wave device group calculates the signal logarithmic energy of the filter output of each frequency range, obtains a logarithmic energy sequence;Previous step is obtained
To logarithmic energy sequence do discrete cosine transform (DCT, Discrete Cosine Transform) the frame voice can be obtained
The MFCC type vocal print feature of signal.
S31: the corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature.
MFCC type vocal print feature has nonlinear characteristic, issues the analysis result in each frequency range closer to human body true
The feature of real voice extracts vocal print feature more accurate, improves the effect of voice print verification.
S32: each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain
Corresponding first vocal print feature of each speech frame in the voice data of the purification.
The present embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background
Model, gauss hybrid models-background model) realize that the vocal print that each vocal print feature vector is each mapped to low dimensional reflects
Other vector I-vector reduces the calculating cost in vocal print feature extraction process, reduces the use cost of voice print verification.This implementation
The training process of the GMM-UBM of example is as follows: B1: obtaining the voice data sample of preset quantity (for example, 100,000), each voice
Data sample corresponds to a vocal print discriminant vectors, and each voice data sample can be acquired from different people in different environments
Voice is formed, and such voice data sample, which is used to training, can characterize the universal background model (GMM- of general characteristics of speech sounds
UBM);B2, each voice data sample is handled respectively to extract the corresponding preset kind of each voice data sample
Vocal print feature, and each voice data sample is constructed based on the corresponding preset kind vocal print feature of each voice data sample and is corresponded to
Vocal print feature vector;B3, the training set that all preset kind vocal print feature vectors constructed are divided into the first percentage and
The verifying collection of second percentage, first percentage and the second percentage are less than or equal to 100%;B4, using in training set
Vocal print feature vector second model is trained, and after training is completed using verifying collection to trained described the
The accuracy rate of two models is verified;If B5, accuracy rate are greater than default accuracy rate (for example, 98.5%), model training terminates,
Otherwise, increase voice data sample quantity, and based on the voice data sample after increase re-execute above-mentioned steps B2, B3,
B4、B5。
The vocal print discriminant vectors of the present embodiment are expressed using I-vector, and i-vector is a vector, relative to Gauss
For the dimension in space, i-vector dimension is lower, calculates cost convenient for reducing, and extracts the process of the i-vector of low dimensional
It is that the vector w of low dimensional and a transition matrix T-phase are multiplied and are mapped to the higher Gauss sky of dimension by following calculation formula
Between.The extraction of I-vector includes the following steps: to extract after the training language data process from certain target speaker
To preset kind vocal print feature vector (for example, MFCC) be input to GMM-VAD model, obtain one and characterize this section of voice data
The Gauss super vector of probability distribution in each Gaussian component;It is corresponding that this section of voice can be calculated using following formula
Compared with the vocal print discriminant vectors I-vector:m of low dimensionalr=μ+T ωr, whereinFor the Gauss super vector for representing this section of voice, μ
For the mean value super vector of second model, T is by the I-vector of low dimensional, ωrIt is mapped to high-dimensional Gaussian spatial
The training of transition matrix, T uses EM algorithm.
Further, the step S4 of the present embodiment, comprising:
S40: obtained in the vocal print feature data of the multiple people prestored respectively it is corresponding prestore vocal print feature,
In, vocal print feature is prestored including target person in the vocal print feature data of multiple people.
The present embodiment passes through the vocal print feature data for the more people including target person that will be prestored, while for judging currently to adopt
Whether the vocal print feature of the voice signal of collection is identical as the vocal print feature of target person, to improve judgment accuracy.
S41: each similarity value prestored between vocal print feature and first vocal print feature is calculated separately.
The similarity value of the present embodiment characterizes the similarity prestored between vocal print feature and first vocal print feature, phase
It is bigger like angle value, then it is both above-mentioned more similar.The acquisition methods of the similarity value of the present embodiment include prestoring vocal print by comparing
Characteristic distance value between feature and first vocal print feature obtains, and features described above distance value includes COS distance value, European
Distance value etc..
S42: each similarity value is ranked up according to sequence from big to small.
The present embodiment is by carrying out each similarity value prestored between vocal print feature and first vocal print feature
It sorts from large to small, so as to the similarity distribution more accurately analyzed the first vocal print feature with respectively prestore vocal print feature, with
Just the verifying to the first vocal print feature is more accurately obtained.
S43: in the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person
Levy corresponding similarity value.
In similarity value of the present embodiment by the preceding preset quantity that sorts, vocal print spy is prestored including the target person
Corresponding similarity value is levied, then determines that the first vocal print feature is identical as the vocal print feature of the target person prestored, to reduce model mistake
The error rates such as poor bring identification, the error rates such as above-mentioned are that " the unsanctioned frequency of the verifying occurred when should be verified and should be verified not
By when the frequency being verified that occurs it is equal ".The similarity value of the preset quantity of the present embodiment includes 1,2 or 3
Deng, can according to use demand carry out from set.
S44: if so, determine that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, it is no
Then it is unsatisfactory for preset condition.
The application other embodiments prestore threshold at a distance from vocal print feature by setting the first vocal print feature and target user's
Value, realizes effective voice print verification.Citing ground, preset threshold 0.6, if calculating prestoring for the first vocal print feature and target user
The COS distance of vocal print feature is less than or equal to preset threshold, it is determined that the first vocal print feature and target user's prestores vocal print spy
It levies identical, is then verified;If the COS distance for prestoring vocal print feature for calculating the first vocal print feature and target user is greater than in advance
If threshold value, it is determined that the first vocal print feature and target user to prestore vocal print feature not identical, then authentication failed.
Further, the step S41 of the present embodiment, comprising:
S410: pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and institute
State the COS distance value between the first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the first vocal print feature
Vocal print discriminant vectors.
The present embodiment passes through COS distance formulaVocal print feature and institute are prestored described in indicating each
The similarity between the first vocal print feature is stated, wherein the distance value of COS distance is smaller, shows that two vocal print features are closer or phase
Together.
S411: the COS distance value is converted into the similarity value, wherein the smallest COS distance value is corresponding
Maximum similarity value.
The present embodiment can be by the inverse proportion formula by COS distance value according to the specified inverse ratio coefficient of carrying, by COS distance
Value is converted into similarity value.
The present embodiment removes the voice number that noise data is purified by the noise data in recognition of speech signals
According to, then according to purified voice data carry out Application on Voiceprint Recognition, improve the accuracy of voice print verification.The present embodiment passes through GMM-
VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification
The degree of voice signal further increases the accuracy of voice print verification.The present embodiment is based on GMM-UBM and realizes each vocal print
Feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process
This, reduces the use cost of voice print verification.The present embodiment is during voice print verification by comparing with the pre-stored data of more people
Compared with analysis, reduce voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
Referring to Fig. 2, a kind of device of voice print verification of one embodiment of the application, comprising:
Discriminating module 1 distinguishes the voice signal for the voice signal to voice print verification to be input in VAD model
In speech frame and noise frame.
The VAD model of the present embodiment, also known as speech terminals detection device, for detecting whether that there are voice in noise circumstance
Voice data.VAD model by giving a mark to each frame voice signal of input, i.e., the frame voice signal be speech frame or
The probability of noise frame, when speech frame probability value be greater than preset decision threshold, then be determined as speech frame, be otherwise noise
Frame.VAD model distinguishes speech frame and noise frame according to above-mentioned court verdict, to remove the noise in voice signal
Frame.The decision threshold of the present embodiment uses the decision threshold defaulted in Webrtc source code, which is Webrtc skill
It is got when art is developed by analyzing mass data, to improve the effect and accuracy distinguished, and reduces the mould of VAD model simultaneously
Type training amount.
Module 2 is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition.
The present embodiment is according to above-mentioned differentiation as a result, by cutting off labeled as the data of noise frame, by remaining each institute's predicate
Sound frame forms the voice data of the purification of each speech frame composition according to the successively continuous arrangement of former time of allocation sequence.
The application other embodiments can also by above-mentioned differentiation as a result, selection markers be speech frame data extract preservation, will mention
Each speech frame deposited go bail for according to the successively continuous arrangement of former time of allocation sequence, forms the described of each speech frame composition
The voice data of purification.The present embodiment is by coming from speaker for the non-of the environment locating for identity registration or authentication link
Background noise data get rid of, influence of the noise data to voice print verification effect in voice signal is reduced, to improve vocal print
It is proved to be successful rate.
Extraction module 3, corresponding first vocal print feature of voice data for extracting the purification.
The present embodiment is by corresponding first vocal print feature of voice data that only analysis purifies, to reduce in voice print verification
Calculation amount, and the validity, specific aim and timeliness of voice print verification are improved simultaneously.
Judgment module 4, for judging it is default whether first vocal print feature and the similarity for prestoring vocal print feature meet
Condition.
The preset condition of the present embodiment includes specified preset threshold range or specified sequence etc., can be according to specific
Application scenarios carry out customized setting, broadly to meet personalized use demand.
Determination module 5, if determining that first vocal print feature prestores vocal print feature with described for meeting preset condition
It is identical, it is otherwise not identical.
The present embodiment will determine that first vocal print feature prestores that vocal print feature is identical with described, then feedback validation passes through
As a result client is arrived, otherwise, the result of feedback validation failure is to client, so that client carries out further according to feedback result
Application operating.Citing ground controls intelligent door opening etc. after being verified.With illustrating again, it is controlled after authentication failed predetermined number of times
Security system carries out screen locking, further destroys e-banking system to prevent offender.
Further, in the VAD model of the present embodiment include Fourier transformation, Gaussian Mixture distribution GMM-NOISE and
GMM-SPEECH, above-mentioned discriminating module 1, comprising:
Conversion unit believes the voice for the voice signal to be input in the Fourier transformation in VAD model
Number it is changed into frequency-region signal form from time-domain signal form.
Time-domain signal form is converted into frequency domain letter by the Fourier transformation in VAD model by the present embodiment correspondingly
Number form facilitates to carry out analyzing the attribute of each frame voice signal and distinguishes speech frame and noise frame.
Discrimination unit, for each frame data of the voice signal of frequency-region signal form to be separately input to the GMM-
VAD judgement is carried out in NOISE and GMM-SPEECH, to distinguish the speech frame and noise frame in voice signal.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input
Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, and to noise and voice at 6
Gaussian Mixture distribution GMM modeling is carried out in frequency range respectively, is had in each frequency range containing there are two the noise GMM- of Gaussian component
NOISE and containing there are two the voice GMM-SPEECH of Gaussian component.Above-mentioned 6 frequency ranges are based on noise and language according to Webrtc technology
The frequency spectrum difference of sound is configured, to improve accuracy of analysis and the matching with Webrtc technology.Other realities of the application
The analysis frequency range for applying example must be not necessarily 6, can be set according to actual needs.And the present embodiment is exchanged based on China
Electric standard is 220V, 50Hz, and the interference of power supply 50Hz can be mixed into the microphone of acquisition voice signal, collected interference signal
And physical shock can affect, the voice signal of the present embodiment preferred acquisition 80Hz or more, to reduce the dry of alternating current
It disturbs, and the attainable highest frequency of voice is 4kHz, so frequency spectrum trough of the present embodiment preferably within the scope of 80Hz to 4kHz
Locate division limits.The VAD judgement of the present embodiment includes local decisions (Local Decision) and global decision (Global
Decisioin)。
Further, the discrimination unit of the present embodiment, comprising:
Subelement is inputted, for being separately input to GMM-NOISE for each frame data of the voice signal of frequency-region signal form
In GMM-SPEECH, the noise frame probability of each frame data is respectively obtainedWith speech frame probability
The present embodiment by be by preanalysis speech frame or noise frame voice signal each frame data, be separately input to
In GMM-NOISE and GMM-SPEECH, the noise frame probability that GMM-NOISE and GMM-SPEECH analyzes each frame data respectively is obtained
Value and speech frame probability value, so as to by comparing noise frame probability value and speech frame probability value size, so that determination is noise
Frame or speech frame.
First computation subunit is used for basis
Calculate partial log likelihood ratio.
The present embodiment is preferably based on the VAD model of mixed Gaussian GMM, to the language of each frame frequency domain signal form of input
Sound signal carries out Energy extraction in 6 frequency ranges, as the feature vector of the frame voice signal, so n value in the present embodiment
It is 6, when judgement each frame, can all carries out 6 local decisions, i.e., carry out local decisions respectively in 6 frequency ranges,
As long as once thinking that the frame is speech frame, that is, retain this frame.
First judgment sub-unit, for judging whether the partial log likelihood ratio is higher than local gate limit value.
The present embodiment realizes the differentiation to speech frame and noise frame by local decisions, and the local decisions of the present embodiment exist
It is done in each frequency range once, 6 times altogether.Likelihood ratio is a kind of index of representation faithfulness, belongs to while reflecting sensitivity and spy
The composite index of different degree, improves Probability estimate accuracy, and the present embodiment is ensuring speech frame probability value greater than noise frame probability value
In the case where, further whether it is higher than local gate limit value by comparing partial log likelihood ratio, to ensure to be determined as that the voice is believed
Number be speech frame accuracy.
First determines subelement, if being higher than local gate limit value for partial log likelihood ratio, determines partial log likelihood
Frame data than being higher than local gate limit value are speech frame.
The parameter of the GMM of the present embodiment has adaptive updates ability, is judged as speech frame in each frame voice signal
Or after noise frame, its parameter for corresponding to model can be updated according to the characteristic value of the frame.For example, if the frame is judged as
Speech frame, then the desired value, standard deviation of GMM-SPEECH and Gaussian component weighted value are just carried out according to the characteristic value of the frame primary
It updates, after more and more speech frames input GMM-SPEECH, GMM-SPEECH can increasingly adapt to this logical voice signal
Speaker vocal print feature, the analysis conclusion provided can be more accurate.
Further, the discrimination unit of another embodiment of the application, comprising:
Second computation subunit, if being not higher than local gate limit value, basis for partial log likelihood ratioCalculate global log-likelihood ratio.
The present embodiment first carries out local decisions, then carries out global decision, and global decision is the base based on local decisions result
The calculating of each frequency range weighted sum is carried out on plinth, to improve the accuracy for distinguishing speech frame and noise frame.
Second judgment sub-unit, for judging whether the global log-likelihood ratio is higher than global threshold.
In the global decision of the present embodiment by global log-likelihood ratio compared with global threshold, to further increase screening
The accuracy of speech frame.
Second determines subelement, if being higher than global threshold for global log-likelihood ratio, determines global log-likelihood
Frame data than being higher than global threshold are speech frame.
The present embodiment can be first according to local decisions result with the presence of voice, then without global decision, to improve vocal print
The efficiency of verifying, and as far as possible can recognize all speech frames, in order to avoid voice distortion.The application other embodiments can also be
Local decisions result is with the presence of voice, then carries out global decision, further to verify and confirm the presence of voice, improves and distinguishes language
The accuracy of sound frame and noise frame.
Further, the extraction module 3 of the present embodiment, comprising:
Extraction unit, the corresponding MFCC type sound of each speech frame in the voice data for extracting the purification
Line feature.
The present embodiment extracts MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient)
The process of type vocal print feature is as follows: first sampling and quantifies, by the continuous analog voice signal of the voice data of purification with certain
Sampling period sampling, be converted into discrete signal, and discrete signal is quantified as by digital signal according to certain coding rule;So
Preemphasis afterwards, due to the physiological property of human body, the radio-frequency component of voice signal is often constrained, and the effect of preemphasis is that compensation is high
Frequency ingredient;Then sub-frame processing, due to " the instantaneous stationarity " of voice signal, when carrying out spectrum analysis to one section of voice signal
It carries out sub-frame processing (generally 10 to 30 millisecond of one frame), feature extraction is then carried out as unit of frame;Then windowing process is made
With being to reduce frame starting and frame end to the discontinuity problem of induction signal, windowing process is carried out using Hamming window;Then to frame
Signal carries out DFT, and signal is transformed into frequency domain from time domain, following formula is then recycled to be mapped to signal from linear spectral domain
Meier spectrum domain:Frame signal after conversion is input to one group of Meier triangle filter
Wave device group calculates the signal logarithmic energy of the filter output of each frequency range, obtains a logarithmic energy sequence;Previous step is obtained
To logarithmic energy sequence do discrete cosine transform (DCT, Discrete Cosine Transform) the frame voice can be obtained
The MFCC type vocal print feature of signal.
Construction unit, for constructing the corresponding vocal print of each speech frame according to each MFCC type vocal print feature
Feature vector.
MFCC type vocal print feature has nonlinear characteristic, issues the analysis result in each frequency range closer to human body true
The feature of real voice extracts vocal print feature more accurate, improves the effect of voice print verification.
Map unit, for each vocal print feature vector to be each mapped to the vocal print discriminant vectors I- of low dimensional
Vector, to obtain corresponding first vocal print feature of each speech frame in the voice data of the purification.
The present embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background
Model, gauss hybrid models-background model) realize that the vocal print that each vocal print feature vector is each mapped to low dimensional reflects
Other vector I-vector reduces the calculating cost in vocal print feature extraction process, reduces the use cost of voice print verification.This implementation
The training process of the GMM-UBM of example is as follows: B1: obtaining the voice data sample of preset quantity (for example, 100,000), each voice
Data sample corresponds to a vocal print discriminant vectors, and each voice data sample can be acquired from different people in different environments
Voice is formed, and such voice data sample, which is used to training, can characterize the universal background model (GMM- of general characteristics of speech sounds
UBM);B2, each voice data sample is handled respectively to extract the corresponding preset kind of each voice data sample
Vocal print feature, and each voice data sample is constructed based on the corresponding preset kind vocal print feature of each voice data sample and is corresponded to
Vocal print feature vector;B3, the training set that all preset kind vocal print feature vectors constructed are divided into the first percentage and
The verifying collection of second percentage, first percentage and the second percentage are less than or equal to 100%;B4, using in training set
Vocal print feature vector second model is trained, and after training is completed using verifying collection to trained described the
The accuracy rate of two models is verified;If B5, accuracy rate are greater than default accuracy rate (for example, 98.5%), model training terminates,
Otherwise, increase voice data sample quantity, and based on the voice data sample after increase re-execute above-mentioned steps B2, B3,
B4、B5。
The vocal print discriminant vectors of the present embodiment are expressed using I-vector, and i-vector is a vector, relative to Gauss
For the dimension in space, i-vector dimension is lower, calculates cost convenient for reducing, and extracts the process of the i-vector of low dimensional
It is that the vector w of low dimensional and a transition matrix T-phase are multiplied and are mapped to the higher Gauss sky of dimension by following calculation formula
Between.The extraction of I-vector includes the following steps: to extract after the training language data process from certain target speaker
To preset kind vocal print feature vector (for example, MFCC) be input to GMM-VAD model, obtain one and characterize this section of voice data
The Gauss super vector of probability distribution in each Gaussian component;It is corresponding that this section of voice can be calculated using following formula
Compared with the vocal print discriminant vectors I-vector:m of low dimensionalr=μ+T ωr, whereinFor the Gauss super vector for representing this section of voice, μ
For the mean value super vector of second model, T is by the I-vector of low dimensional, ωrIt is mapped to high-dimensional Gaussian spatial
The training of transition matrix, T uses EM algorithm.
Further, the judgment module 4 of the present embodiment, comprising:
Acquiring unit corresponding prestores vocal print for obtaining in the vocal print feature data of the multiple people prestored respectively
Feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people.
The present embodiment passes through the vocal print feature data for the more people including target person that will be prestored, while for judging currently to adopt
Whether the vocal print feature of the voice signal of collection is identical as the vocal print feature of target person, to improve judgment accuracy.
Computing unit, for calculating separately each similarity prestored between vocal print feature and first vocal print feature
Value.
The similarity value of the present embodiment characterizes the similarity prestored between vocal print feature and first vocal print feature, phase
It is bigger like angle value, then it is both above-mentioned more similar.The acquisition methods of the similarity value of the present embodiment include prestoring vocal print by comparing
Characteristic distance value between feature and first vocal print feature obtains, and features described above distance value includes COS distance value, European
Distance value etc..
Sequencing unit, for each similarity value to be ranked up according to sequence from big to small.
The present embodiment is by carrying out each similarity value prestored between vocal print feature and first vocal print feature
It sorts from large to small, so as to the similarity distribution more accurately analyzed the first vocal print feature with respectively prestore vocal print feature, with
Just the verifying to the first vocal print feature is more accurately obtained.
Judging unit, in the similarity value for the preceding preset quantity that judges to sort, if including the target person
Prestore the corresponding similarity value of vocal print feature.
In similarity value of the present embodiment by the preceding preset quantity that sorts, vocal print spy is prestored including the target person
Corresponding similarity value is levied, then determines that the first vocal print feature is identical as the vocal print feature of the target person prestored, to reduce model mistake
The error rates such as poor bring identification, the error rates such as above-mentioned are that " the unsanctioned frequency of the verifying occurred when should be verified and should be verified not
By when the frequency being verified that occurs it is equal ".The similarity value of the preset quantity of the present embodiment includes 1,2 or 3
Deng, can according to use demand carry out from set.
Judging unit, if for including that the target person prestores the corresponding similarity value of vocal print feature, determine described in
First vocal print feature and the similarity for prestoring vocal print feature meet preset condition, are otherwise unsatisfactory for preset condition.
The application other embodiments prestore threshold at a distance from vocal print feature by setting the first vocal print feature and target user's
Value, realizes effective voice print verification.Citing ground, preset threshold 0.6, if calculating prestoring for the first vocal print feature and target user
The COS distance of vocal print feature is less than or equal to preset threshold, it is determined that the first vocal print feature and target user's prestores vocal print spy
It levies identical, is then verified;If the COS distance for prestoring vocal print feature for calculating the first vocal print feature and target user is greater than in advance
If threshold value, it is determined that the first vocal print feature and target user to prestore vocal print feature not identical, then authentication failed.
Further, the computing unit of the present embodiment, comprising:
Third computation subunit, for passing through COS distance formula respectivelyIt calculates each described
Prestore the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y
Represent the vocal print discriminant vectors of the first vocal print feature.
The present embodiment passes through COS distance formulaVocal print feature and institute are prestored described in indicating each
The similarity between the first vocal print feature is stated, wherein the distance value of COS distance is smaller, shows that two vocal print features are closer or phase
Together.
Conversion subunit, for the COS distance value to be converted into the similarity value, wherein the smallest cosine
Distance value corresponds to maximum similarity value.
The present embodiment can be by the inverse proportion formula by COS distance value according to the specified inverse ratio coefficient of carrying, by COS distance
Value is converted into similarity value.
Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be server,
Its internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, network connected by system bus
Interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment is deposited
Reservoir includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program
And database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium.It should
The database of computer equipment is used to store all data that the process of voice print verification needs.The network interface of the computer equipment
For being communicated with external end by network connection.The side of voice print verification is realized when the computer program is executed by processor
Method.
The method that above-mentioned processor executes above-mentioned voice print verification, comprising: the voice signal to voice print verification is input to VAD
In model, the speech frame and noise frame in the voice signal are distinguished;The noise frame is removed, each speech frame composition is obtained
Purification voice data;Extract corresponding first vocal print feature of voice data of the purification;Judge that first vocal print is special
Whether the similarity for levying and prestoring vocal print feature meets preset condition;If satisfied, then determine first vocal print feature with it is described
It is identical to prestore vocal print feature, it is otherwise not identical.
Above-mentioned computer equipment by the noise data in recognition of speech signals, and removes what noise data was purified
Then voice data carries out Application on Voiceprint Recognition according to purified voice data, improves the accuracy of voice print verification.Pass through GMM-
VAD model, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to improve purification
The degree of voice signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM by each vocal print feature vector
It is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces the calculating cost in vocal print feature extraction process, is reduced
The use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, vocal print is reduced
Verifying etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
In one embodiment, in the VAD model include Fourier transformation, Gaussian Mixture be distributed GMM-NOISE and
The voice signal is input in VAD model by GMM-SPEECH, above-mentioned processor, distinguish voice signal in each speech frame and
The step of each noise frame, comprising: the voice signal is input in the Fourier transformation in VAD model, the voice is believed
Number it is changed into frequency-region signal form from time-domain signal form;Each frame data difference of the voice signal of frequency-region signal form is defeated
Enter into the GMM-NOISE and GMM-SPEECH and carry out VAD judgement, to distinguish the speech frame and noise frame in voice signal.
In one embodiment, above-mentioned processor inputs each frame data of the voice signal of frequency-region signal form respectively
VAD judgement is carried out, into the GMM-NOISE and GMM-SPEECH to distinguish the step of speech frame and noise frame in voice signal
Suddenly, comprising: each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH,
Respectively obtain the noise frame probability of each frame dataWith speech frame probabilityAccording toCalculate partial log likelihood
Than;Judge whether the partial log likelihood ratio is higher than local gate limit value;If so, determining that the partial log likelihood ratio is higher than
The frame data of local gate limit value are speech frame.
In one embodiment, above-mentioned processor judge the step of whether log-likelihood ratio is higher than local gate limit value it
Afterwards, comprising: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;If global log-likelihood ratio is higher than global threshold, sentence
The frame data that the fixed global log-likelihood ratio is higher than global threshold are speech frame.
In one embodiment, above-mentioned processor extracts the step of corresponding first vocal print feature of voice data of the purification
Suddenly, comprising: extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;According to each
The MFCC type vocal print feature constructs the corresponding vocal print feature vector of each speech frame;By each vocal print feature to
Amount is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain each voice in the voice data of the purification
Corresponding first vocal print feature of frame.
In one embodiment, above-mentioned processor judges that first vocal print feature is with the similarity for prestoring vocal print feature
No the step of meeting preset condition, comprising: obtained in the vocal print feature data of the multiple people prestored respectively corresponding pre-
Deposit vocal print feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people;It calculates separately each described
Prestore the similarity value between vocal print feature and first vocal print feature;Each similarity value is suitable according to from big to small
Sequence is ranked up;In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person
Levy corresponding similarity value;If so, determining that first vocal print feature meets default item with the similarity for prestoring vocal print feature
Otherwise part is unsatisfactory for preset condition.
In one embodiment, above-mentioned processor, which calculates separately, each described prestores vocal print feature and first vocal print feature
Between similarity value the step of, comprising: respectively pass through COS distance formulaIt calculates each described pre-
Deposit the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y generation
The vocal print discriminant vectors of the first vocal print feature of table;The COS distance value is converted into the similarity value, wherein the smallest institute
It states COS distance value and corresponds to maximum similarity value.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied
The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates
The method of voice print verification is realized when machine program is executed by processor, comprising: the voice signal to voice print verification is input to VAD mould
In type, the speech frame and noise frame in the voice signal are distinguished;The noise frame is removed, each speech frame composition is obtained
The voice data of purification;Extract corresponding first vocal print feature of voice data of the purification;Judge first vocal print feature
Whether meet preset condition with the similarity for prestoring vocal print feature;If satisfied, then determine first vocal print feature with it is described pre-
It is identical to deposit vocal print feature, it is otherwise not identical.
Above-mentioned computer readable storage medium by the noise data in recognition of speech signals, and removes noise data and obtains
To the voice data of purification, Application on Voiceprint Recognition then is carried out according to purified voice data, improves the accuracy of voice print verification.It is logical
GMM-VAD model is crossed, in conjunction with local decisions and global decision, realization accurately distinguishes noise data and voice data, to mention
The degree of high clean speech signal further increases the accuracy of voice print verification.It is realized based on GMM-UBM each vocal print is special
Sign vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, reduces and is calculated as in vocal print feature extraction process
This, reduces the use cost of voice print verification.By being compared analysis with the pre-stored data of more people during voice print verification, drop
Low voice print verification etc. error rates, reduce the error of the model error bring voice print verification precision of voice print verification.
In one embodiment, in the VAD model include Fourier transformation, Gaussian Mixture be distributed GMM-NOISE and
The voice signal is input in VAD model by GMM-SPEECH, above-mentioned processor, distinguish voice signal in each speech frame and
The step of each noise frame, comprising: the voice signal is input in the Fourier transformation in VAD model, the voice is believed
Number it is changed into frequency-region signal form from time-domain signal form;Each frame data difference of the voice signal of frequency-region signal form is defeated
Enter into the GMM-NOISE and GMM-SPEECH and carry out VAD judgement, to distinguish the speech frame and noise frame in voice signal.
In one embodiment, above-mentioned processor inputs each frame data of the voice signal of frequency-region signal form respectively
VAD judgement is carried out, into the GMM-NOISE and GMM-SPEECH to distinguish the step of speech frame and noise frame in voice signal
Suddenly, comprising: each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH,
Respectively obtain the noise frame probability of each frame dataWith speech frame probabilityAccording toCalculate partial log likelihood
Than;Judge whether the partial log likelihood ratio is higher than local gate limit value;If so, determining that the partial log likelihood ratio is higher than
The frame data of local gate limit value are speech frame.
In one embodiment, above-mentioned processor judge the step of whether log-likelihood ratio is higher than local gate limit value it
Afterwards, comprising: if partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;If global log-likelihood ratio is higher than global threshold, sentence
The frame data that the fixed global log-likelihood ratio is higher than global threshold are speech frame.
In one embodiment, above-mentioned processor extracts the step of corresponding first vocal print feature of voice data of the purification
Suddenly, comprising: extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;According to each
The MFCC type vocal print feature constructs the corresponding vocal print feature vector of each speech frame;By each vocal print feature to
Amount is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain each voice in the voice data of the purification
Corresponding first vocal print feature of frame.
In one embodiment, above-mentioned processor judges that first vocal print feature is with the similarity for prestoring vocal print feature
No the step of meeting preset condition, comprising: obtained in the vocal print feature data of the multiple people prestored respectively corresponding pre-
Deposit vocal print feature, wherein prestore vocal print feature including target person in the vocal print feature data of multiple people;It calculates separately each described
Prestore the similarity value between vocal print feature and first vocal print feature;Each similarity value is suitable according to from big to small
Sequence is ranked up;In the similarity value of the preceding preset quantity of judgement sequence, if prestore vocal print spy including the target person
Levy corresponding similarity value;If so, determining that first vocal print feature meets default item with the similarity for prestoring vocal print feature
Otherwise part is unsatisfactory for preset condition.
In one embodiment, above-mentioned processor, which calculates separately, each described prestores vocal print feature and first vocal print feature
Between similarity value the step of, comprising: respectively pass through COS distance formulaIt calculates each described pre-
Deposit the COS distance value between vocal print feature and first vocal print feature, wherein x representative respectively prestores vocal print discriminant vectors, y generation
The vocal print discriminant vectors of the first vocal print feature of table;The COS distance value is converted into the similarity value, wherein the smallest institute
It states COS distance value and corresponds to maximum similarity value.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, above-mentioned computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
Any reference used in provided herein and embodiment to memory, storage, database or other media,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, device of element, article or method.
The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations
Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations
Technical field, similarly include in the scope of patent protection of the application.
Claims (10)
1. a kind of method of voice print verification characterized by comprising
Voice signal to voice print verification is input in VAD model, the speech frame and noise frame in the voice signal are distinguished;
The noise frame is removed, the voice data of the purification of each speech frame composition is obtained;
Extract corresponding first vocal print feature of voice data of the purification;
Judge whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
If satisfied, then determine that first vocal print feature is identical as the vocal print feature that prestores, it is otherwise not identical.
2. the method for voice print verification according to claim 1, which is characterized in that include that Fourier becomes in the VAD model
It changes, the GMM-NOISE and GMM-SPEECH of Gaussian Mixture distribution, it is described that the voice signal is input in VAD model, it distinguishes
The step of speech frame and noise frame in voice signal, comprising:
The voice signal is input in the Fourier transformation in VAD model, by the voice signal from time-domain signal form
It is changed into frequency-region signal form;
Each frame data of the voice signal of frequency-region signal form are separately input in the GMM-NOISE and GMM-SPEECH
VAD judgement is carried out, to distinguish the speech frame and noise frame in voice signal.
3. the method for voice print verification according to claim 2, which is characterized in that the voice by frequency-region signal form is believed
Number each frame data be separately input in the GMM-NOISE and GMM-SPEECH carry out VAD judgement, to distinguish voice signal
In speech frame and the step of noise frame, comprising:
Each frame data of the voice signal of frequency-region signal form are separately input in GMM-NOISE and GMM-SPEECH, respectively
Obtain the noise frame probability of each frame dataWith speech frame probability
According toIt is right to calculate part
Number likelihood ratio;
Judge whether the partial log likelihood ratio is higher than local gate limit value;
If so, the frame data for determining that the partial log likelihood ratio is higher than local gate limit value are speech frame.
4. the method for voice print verification according to claim 3, which is characterized in that described whether to judge the log-likelihood ratio
After the step of higher than local gate limit value, comprising:
If partial log likelihood ratio is not higher than local gate limit value, basisCalculate global log-likelihood ratio;
Judge whether the global log-likelihood ratio is higher than global threshold;
If global log-likelihood ratio is higher than global threshold, determine that the global log-likelihood ratio is higher than the frame of global threshold
Data are speech frame.
5. the method for voice print verification according to claim 1, which is characterized in that the voice data for extracting the purification
The step of corresponding first vocal print feature, comprising:
Extract the corresponding MFCC type vocal print feature of each speech frame in the voice data of the purification;
The corresponding vocal print feature vector of each speech frame is constructed according to each MFCC type vocal print feature;
Each vocal print feature vector is each mapped to the vocal print discriminant vectors I-vector of low dimensional, to obtain the purification
Voice data in corresponding first vocal print feature of each speech frame.
6. the method for voice print verification according to claim 5, which is characterized in that the judgement first vocal print feature with
The step of whether similarity of vocal print feature meets preset condition prestored, comprising:
It is obtained in the vocal print feature data of the multiple people prestored respectively and corresponding prestores vocal print feature, wherein Duo Geren
Vocal print feature data in include that target person prestores vocal print feature;
Calculate separately each similarity value prestored between vocal print feature and first vocal print feature;
Each similarity value is ranked up according to sequence from big to small;
Judgement sort preceding preset quantity similarity value in, if including the target person to prestore vocal print feature corresponding
Similarity value;
If so, determining that first vocal print feature meets preset condition with the similarity for prestoring vocal print feature, otherwise it is unsatisfactory for
Preset condition.
7. the method for voice print verification according to claim 6, which is characterized in that described calculate separately each described prestores vocal print
The step of similarity value between feature and first vocal print feature, comprising:
Pass through COS distance formula respectivelyIt calculates and each described prestores vocal print feature and first sound
COS distance value between line feature, wherein x representative respectively prestores vocal print discriminant vectors, and y represents the vocal print mirror of the first vocal print feature
Other vector;
The COS distance value is converted into the similarity value, wherein the smallest COS distance value corresponds to maximum phase
Like angle value.
8. a kind of device of voice print verification characterized by comprising
Discriminating module distinguishes the language in the voice signal for the voice signal to voice print verification to be input in VAD model
Sound frame and noise frame;
Module is removed, for removing the noise frame, obtains the voice data of the purification of each speech frame composition;
Extraction module, corresponding first vocal print feature of voice data for extracting the purification;
Judgment module, for judging whether first vocal print feature meets preset condition with the similarity for prestoring vocal print feature;
Determination module, if determining that first vocal print feature is identical as the vocal print feature that prestores for meeting preset condition,
Otherwise not identical.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811184693.9A CN109378002B (en) | 2018-10-11 | 2018-10-11 | Voiceprint verification method, voiceprint verification device, computer equipment and storage medium |
PCT/CN2018/124401 WO2020073518A1 (en) | 2018-10-11 | 2018-12-27 | Voiceprint verification method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811184693.9A CN109378002B (en) | 2018-10-11 | 2018-10-11 | Voiceprint verification method, voiceprint verification device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109378002A true CN109378002A (en) | 2019-02-22 |
CN109378002B CN109378002B (en) | 2024-05-07 |
Family
ID=65403684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811184693.9A Active CN109378002B (en) | 2018-10-11 | 2018-10-11 | Voiceprint verification method, voiceprint verification device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109378002B (en) |
WO (1) | WO2020073518A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110265012A (en) * | 2019-06-19 | 2019-09-20 | 泉州师范学院 | It can interactive intelligence voice home control device and control method based on open source hardware |
CN110675878A (en) * | 2019-09-23 | 2020-01-10 | 金瓜子科技发展(北京)有限公司 | Method and device for identifying vehicle and merchant, storage medium and electronic equipment |
CN111274434A (en) * | 2020-01-16 | 2020-06-12 | 上海携程国际旅行社有限公司 | Audio corpus automatic labeling method, system, medium and electronic equipment |
CN112331217A (en) * | 2020-11-02 | 2021-02-05 | 泰康保险集团股份有限公司 | Voiceprint recognition method and device, storage medium and electronic equipment |
CN112735433A (en) * | 2020-12-29 | 2021-04-30 | 平安普惠企业管理有限公司 | Identity verification method, device, equipment and storage medium |
WO2021098153A1 (en) * | 2019-11-18 | 2021-05-27 | 锐迪科微电子科技(上海)有限公司 | Method, system, and electronic apparatus for detecting change of target user, and storage medium |
WO2021127990A1 (en) * | 2019-12-24 | 2021-07-01 | 广州国音智能科技有限公司 | Voiceprint recognition method based on voice noise reduction and related apparatus |
CN113488059A (en) * | 2021-08-13 | 2021-10-08 | 广州市迪声音响有限公司 | Voiceprint recognition method and system |
WO2022068675A1 (en) * | 2020-09-29 | 2022-04-07 | 华为技术有限公司 | Speaker speech extraction method and apparatus, storage medium, and electronic device |
JP2022536190A (en) * | 2020-04-28 | 2022-08-12 | 平安科技(深▲せん▼)有限公司 | Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102479511A (en) * | 2010-11-23 | 2012-05-30 | 盛乐信息技术(上海)有限公司 | Large-scale voiceprint authentication method and system |
CN105575406A (en) * | 2016-01-07 | 2016-05-11 | 深圳市音加密科技有限公司 | Noise robustness detection method based on likelihood ratio test |
CN107068154A (en) * | 2017-03-13 | 2017-08-18 | 平安科技(深圳)有限公司 | The method and system of authentication based on Application on Voiceprint Recognition |
CN108109612A (en) * | 2017-12-07 | 2018-06-01 | 苏州大学 | A kind of speech recognition sorting technique based on self-adaptive reduced-dimensions |
CN108154371A (en) * | 2018-01-12 | 2018-06-12 | 平安科技(深圳)有限公司 | Electronic device, the method for authentication and storage medium |
CN108172230A (en) * | 2018-01-03 | 2018-06-15 | 平安科技(深圳)有限公司 | Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model |
CN108428456A (en) * | 2018-03-29 | 2018-08-21 | 浙江凯池电子科技有限公司 | Voice de-noising algorithm |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6349278B1 (en) * | 1999-08-04 | 2002-02-19 | Ericsson Inc. | Soft decision signal estimation |
CN103236260B (en) * | 2013-03-29 | 2015-08-12 | 京东方科技集团股份有限公司 | Speech recognition system |
-
2018
- 2018-10-11 CN CN201811184693.9A patent/CN109378002B/en active Active
- 2018-12-27 WO PCT/CN2018/124401 patent/WO2020073518A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102479511A (en) * | 2010-11-23 | 2012-05-30 | 盛乐信息技术(上海)有限公司 | Large-scale voiceprint authentication method and system |
CN105575406A (en) * | 2016-01-07 | 2016-05-11 | 深圳市音加密科技有限公司 | Noise robustness detection method based on likelihood ratio test |
CN107068154A (en) * | 2017-03-13 | 2017-08-18 | 平安科技(深圳)有限公司 | The method and system of authentication based on Application on Voiceprint Recognition |
CN108109612A (en) * | 2017-12-07 | 2018-06-01 | 苏州大学 | A kind of speech recognition sorting technique based on self-adaptive reduced-dimensions |
CN108172230A (en) * | 2018-01-03 | 2018-06-15 | 平安科技(深圳)有限公司 | Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model |
CN108154371A (en) * | 2018-01-12 | 2018-06-12 | 平安科技(深圳)有限公司 | Electronic device, the method for authentication and storage medium |
CN108428456A (en) * | 2018-03-29 | 2018-08-21 | 浙江凯池电子科技有限公司 | Voice de-noising algorithm |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110265012A (en) * | 2019-06-19 | 2019-09-20 | 泉州师范学院 | It can interactive intelligence voice home control device and control method based on open source hardware |
CN110675878A (en) * | 2019-09-23 | 2020-01-10 | 金瓜子科技发展(北京)有限公司 | Method and device for identifying vehicle and merchant, storage medium and electronic equipment |
WO2021098153A1 (en) * | 2019-11-18 | 2021-05-27 | 锐迪科微电子科技(上海)有限公司 | Method, system, and electronic apparatus for detecting change of target user, and storage medium |
WO2021127990A1 (en) * | 2019-12-24 | 2021-07-01 | 广州国音智能科技有限公司 | Voiceprint recognition method based on voice noise reduction and related apparatus |
CN111274434A (en) * | 2020-01-16 | 2020-06-12 | 上海携程国际旅行社有限公司 | Audio corpus automatic labeling method, system, medium and electronic equipment |
JP2022536190A (en) * | 2020-04-28 | 2022-08-12 | 平安科技(深▲せん▼)有限公司 | Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium |
JP7184236B2 (en) | 2020-04-28 | 2022-12-06 | 平安科技(深▲せん▼)有限公司 | Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium |
WO2022068675A1 (en) * | 2020-09-29 | 2022-04-07 | 华为技术有限公司 | Speaker speech extraction method and apparatus, storage medium, and electronic device |
CN112331217A (en) * | 2020-11-02 | 2021-02-05 | 泰康保险集团股份有限公司 | Voiceprint recognition method and device, storage medium and electronic equipment |
CN112331217B (en) * | 2020-11-02 | 2023-09-12 | 泰康保险集团股份有限公司 | Voiceprint recognition method and device, storage medium and electronic equipment |
CN112735433A (en) * | 2020-12-29 | 2021-04-30 | 平安普惠企业管理有限公司 | Identity verification method, device, equipment and storage medium |
CN113488059A (en) * | 2021-08-13 | 2021-10-08 | 广州市迪声音响有限公司 | Voiceprint recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
WO2020073518A1 (en) | 2020-04-16 |
CN109378002B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109378002A (en) | Method, apparatus, computer equipment and the storage medium of voice print verification | |
WO2020177380A1 (en) | Voiceprint detection method, apparatus and device based on short text, and storage medium | |
CN107610707A (en) | A kind of method for recognizing sound-groove and device | |
CN109346086A (en) | Method for recognizing sound-groove, device, computer equipment and computer readable storage medium | |
CN104978507B (en) | A kind of Intelligent controller for logging evaluation expert system identity identifying method based on Application on Voiceprint Recognition | |
CN103971690A (en) | Voiceprint recognition method and device | |
CN110443692A (en) | Enterprise's credit authorization method, apparatus, equipment and computer readable storage medium | |
JP2008509432A (en) | Method and system for verifying and enabling user access based on voice parameters | |
US20070198262A1 (en) | Topological voiceprints for speaker identification | |
CN109473105A (en) | The voice print verification method, apparatus unrelated with text and computer equipment | |
CN107346568A (en) | The authentication method and device of a kind of gate control system | |
CN110164453A (en) | A kind of method for recognizing sound-groove, terminal, server and the storage medium of multi-model fusion | |
CN108154371A (en) | Electronic device, the method for authentication and storage medium | |
CN111081223B (en) | Voice recognition method, device, equipment and storage medium | |
Karthikeyan | Adaptive boosted random forest-support vector machine based classification scheme for speaker identification | |
Revathi et al. | Person authentication using speech as a biometric against play back attacks | |
Chaudhari et al. | Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction | |
Mubeen et al. | Detection of impostor and tampered segments in audio by using an intelligent system | |
CN115102789A (en) | Anti-communication network fraud studying, judging, early-warning and intercepting comprehensive platform | |
CN110188338A (en) | The relevant method for identifying speaker of text and equipment | |
Nagakrishnan et al. | Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models | |
Hossan et al. | Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization | |
CN112992155A (en) | Far-field voice speaker recognition method and device based on residual error neural network | |
Gupta et al. | Text dependent voice based biometric authentication system using spectrum analysis and image acquisition | |
TWI778234B (en) | Speaker verification system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |