CN109378002B

CN109378002B - Voiceprint verification method, voiceprint verification device, computer equipment and storage medium

Info

Publication number: CN109378002B
Application number: CN201811184693.9A
Authority: CN
Inventors: 杨翘楚; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2024-05-07
Anticipated expiration: 2038-10-11
Also published as: CN109378002A; WO2020073518A1

Abstract

The application discloses a voiceprint verification method, a voiceprint verification device, computer equipment and a storage medium, wherein the voiceprint verification method comprises the following steps: inputting a voice signal to be voiceprint verified into a VAD model, and distinguishing a voice frame and a noise frame in the voice signal; removing the noise frames to obtain purified voice data formed by the voice frames; extracting first voiceprint features corresponding to the purified voice data; judging whether the similarity of the first voiceprint features and the pre-stored voiceprint features meets a preset condition or not; if yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same. According to the voice print verification method, the voice print verification device and the voice print verification system, the noise data in the voice signals are recognized, the noise data are removed to obtain the purified voice data, and then voice print recognition is carried out according to the purified voice data, so that the accuracy of voice print verification is improved.

Description

Voiceprint verification method, voiceprint verification device, computer equipment and storage medium

Technical Field

The present application relates to the field of voiceprint verification, and in particular, to a voiceprint verification method, apparatus, computer device, and storage medium.

Background

At present, the business scope of many large-scale finance companies relates to a plurality of business categories such as insurance, banking, investment and the like, and each business category generally needs to be communicated with a client and needs to be identified by anti-fraud, so that the authentication and the anti-fraud identification of the client become important components for ensuring the business safety. In the client authentication link, voiceprint authentication is adopted by many companies due to its real-time and ease of use. In practical application, the collected voice data often has background noise not coming from the speaker, which is affected by the environmental factor of the speaker in the identity registration or the identity verification link, and this factor becomes one of the main factors affecting the success rate of voiceprint verification.

Disclosure of Invention

The application mainly aims to provide a voiceprint verification method, which aims to solve the technical problem that noise in the existing voice data has adverse effect on voiceprint verification effect.

The application provides a voiceprint verification method, which comprises the following steps:

Inputting a voice signal to be voiceprint verified into a VAD model, and distinguishing a voice frame and a noise frame in the voice signal;

removing the noise frames to obtain purified voice data formed by the voice frames;

extracting first voiceprint features corresponding to the purified voice data;

Judging whether the similarity of the first voiceprint features and the pre-stored voiceprint features meets a preset condition or not;

If yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same.

Preferably, the VAD model includes a fourier transform, a gaussian mixture distribution of GMM-NOISE and GMM-SPEECH, and the step of inputting the SPEECH signal into the VAD model to distinguish a SPEECH frame from a NOISE frame in the SPEECH signal includes:

Inputting the voice signal into Fourier transform in VAD model, converting the voice signal from time domain signal form to frequency domain signal form;

each frame of data of the SPEECH signal in the form of a frequency domain signal is input into the GMM-NOISE and GMM-SPEECH, respectively, for VAD decisions to distinguish between SPEECH frames and NOISE frames in the SPEECH signal.

Preferably, the step of inputting each frame data of the voice signal in the form of the frequency domain signal into the GMM-NOISE and GMM-SPEECH to make VAD decision to distinguish the voice frame from the NOISE frame in the voice signal, respectively, includes:

Inputting each frame of data of voice signal in frequency domain signal form into GMM-NOISE and GMM-SPEECH to obtain NOISE frame probability of each frame of data And speech frame probability/>

According toCalculating a local log-likelihood ratio;

Judging whether the local log likelihood ratio is higher than a local threshold value or not;

If yes, judging that the frame data with the local log likelihood ratio higher than a local threshold value is a voice frame.

Preferably, after the step of determining whether the log likelihood ratio is higher than a local threshold value, the step of determining includes:

If the local log likelihood ratio is not higher than the local threshold value, then according to Calculating a global log-likelihood ratio;

judging whether the global log likelihood ratio is higher than a global threshold value or not;

if the global log-likelihood ratio is higher than the global threshold value, judging that the frame data with the global log-likelihood ratio higher than the global threshold value is a voice frame.

Preferably, the step of extracting the first voiceprint feature corresponding to the purified voice data includes:

Extracting MFCC type voiceprint features corresponding to each voice frame in the purified voice data;

Constructing voiceprint feature vectors corresponding to the voice frames respectively according to the voiceprint features of the MFCC types;

and mapping each voiceprint feature vector into a low-dimensional voiceprint identification vector I-vector respectively to obtain first voiceprint features corresponding to each voice frame in the purified voice data.

Preferably, the step of determining whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets a preset condition includes:

Respectively acquiring corresponding pre-stored voiceprint features from the pre-stored voiceprint feature data of a plurality of persons, wherein the voiceprint feature data of the plurality of persons comprises pre-stored voiceprint features of a target person;

respectively calculating similarity values between each pre-stored voiceprint feature and the first voiceprint feature;

Sorting the similarity values according to the order from big to small;

Judging whether the similarity values of the preset number of the sequenced voice print features of the target person are included or not;

if yes, judging that the similarity of the first voiceprint features and the pre-stored voiceprint features meets the preset condition, otherwise, not meeting the preset condition.

Preferably, the step of calculating similarity values between each of the pre-stored voiceprint features and the first voiceprint feature, respectively, includes:

Respectively through cosine distance formula Calculating cosine distance values between each pre-stored voiceprint feature and the first voiceprint feature, wherein x represents each pre-stored voiceprint identification vector and y represents the voiceprint identification vector of the first voiceprint feature;

And converting the cosine distance value into the similarity value, wherein the smallest cosine distance value corresponds to the largest similarity value.

The application also provides a voiceprint verification device, which comprises:

The distinguishing module is used for inputting the voice signal to be voiceprint verified into the VAD model and distinguishing the voice frame and the noise frame in the voice signal;

the removing module is used for removing the noise frames to obtain purified voice data formed by the voice frames;

the extraction module is used for extracting first voiceprint features corresponding to the purified voice data;

The judging module is used for judging whether the similarity between the first voiceprint characteristics and the pre-stored voiceprint characteristics meets a preset condition or not;

And the judging module is used for judging that the first voiceprint characteristics are the same as the pre-stored voiceprint characteristics if the preset conditions are met, and otherwise, the first voiceprint characteristics are different.

The application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The application also provides a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the above method.

According to the voice print verification method, the voice print verification device and the voice print verification system, the noise data in the voice signals are recognized, the noise data are removed to obtain the purified voice data, and then voice print recognition is carried out according to the purified voice data, so that the accuracy of voice print verification is improved. According to the application, through the GMM-VAD model and combining the local judgment and the global judgment, the accurate distinction of noise data and voice data is realized, so that the degree of purifying voice signals is improved, and the accuracy of voiceprint verification is further improved. The application maps each voiceprint feature vector into the voiceprint identification vector I-vector with low dimensionality based on GMM-UBM, reduces the calculation cost in the voiceprint feature extraction process, and reduces the use cost of voiceprint verification. According to the application, through comparing and analyzing with the prestored data of a plurality of persons in the voiceprint verification process, the error rate of voiceprint verification is reduced, and the error of voiceprint verification accuracy caused by the model error of voiceprint verification is reduced.

Drawings

FIG. 1 is a flow chart of a method for voiceprint authentication according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a voiceprint authentication apparatus according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, a method for voiceprint authentication according to an embodiment of the present application includes:

s1: the voice signal to be voiceprint verified is input into the VAD model, and the voice frame and the noise frame in the voice signal are distinguished.

The VAD model of the present embodiment, also called a voice endpoint detector, is used to detect whether voice data of human voice exists in a noise environment. The VAD model judges that each frame of voice signal is a voice frame or a noise frame by scoring the inputted voice signal, namely the probability that the voice signal is the voice frame or the noise frame, when the probability value of the voice frame is larger than a preset judgment threshold, the voice frame is judged, and otherwise, the voice frame is judged to be the noise frame. The VAD model distinguishes the voice frame from the noise frame according to the judgment result so as to remove the noise frame in the voice signal. The decision threshold in this embodiment adopts a default decision threshold in Webrtc source codes, and the decision threshold is obtained by analyzing a large amount of data during Webrtc technology development, so as to improve the distinguishing effect and accuracy, and reduce the model training workload of the VAD model.

S2: and removing the noise frames to obtain purified voice data formed by the voice frames.

According to the distinguishing result, the embodiment cuts off the data marked as noise frames, and sequentially and continuously arranges the rest voice frames according to the original arrangement time sequence to form the purified voice data formed by the voice frames. According to other embodiments of the present application, the above-mentioned distinguishing result may be used to screen the data marked as voice frames for extraction and storage, and the extracted and stored voice frames are sequentially and continuously arranged according to the original arrangement time sequence, so as to form the purified voice data composed of the voice frames. According to the embodiment, the background noise data of the environment where the identity registration or the identity verification link is located, which is not from a speaker, is removed, so that the influence of noise data in a voice signal on the voiceprint verification effect is reduced, and the voiceprint verification success rate is improved.

S3: and extracting a first voiceprint feature corresponding to the purified voice data.

According to the embodiment, only the first voiceprint features corresponding to the purified voice data are analyzed, so that the calculated amount in voiceprint verification is reduced, and the effectiveness, pertinence and timeliness of voiceprint verification are improved.

S4: and judging whether the similarity between the first voiceprint features and the pre-stored voiceprint features meets a preset condition or not.

The preset conditions in this embodiment include a specified preset threshold range, or a specified ordering, etc., and may be set in a user-defined manner according to a specific application scenario, so as to more widely meet the personalized use requirement.

S5: if yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same.

The embodiment will determine that the first voiceprint feature is the same as the pre-stored voiceprint feature, and feed back the verification passed result to the client, otherwise, feed back the verification failed result to the client, so that the client performs further application operation according to the feedback result. For example, control of smart door opening after verification passes, and so forth. For another example, after the verification fails for a designated number of times, the security system is controlled to lock the screen so as to prevent criminals from further damaging the electronic banking system.

Further, the VAD model of the present embodiment includes fourier transform, GMM-NOISE and GMM-SPEECH of gaussian mixture distribution, and step S1 includes:

s100: the speech signal is input into a fourier transform in a VAD model, which converts the speech signal from a time-domain signal form to a frequency-domain signal form.

In the embodiment, the time domain signal form is converted into the frequency domain signal form in a one-to-one correspondence manner through Fourier transformation in the VAD model so as to analyze the attribute of the voice signal of each frame and facilitate distinguishing the voice frame from the noise frame.

S101: each frame of data of the SPEECH signal in the form of a frequency domain signal is input into the GMM-NOISE and GMM-SPEECH, respectively, for VAD decisions to distinguish between SPEECH frames and NOISE frames in the SPEECH signal.

The present embodiment is preferably based on a VAD model of a gaussian mixture GMM, which extracts energy from an input SPEECH signal in the form of a frame frequency domain signal over 6 frequency bands as a feature vector of the frame SPEECH signal, and models a gaussian mixture distribution GMM of NOISE and SPEECH over 6 frequency bands, respectively, each frequency band having a NOISE GMM-NOISE containing two gaussian components and a SPEECH GMM-SPEECH containing two gaussian components. The above 6 frequency bands are set according to Webrtc technology based on the spectral difference of noise and voice in order to improve analysis accuracy and matching with Webrtc technology. The number of analysis frequency bands in other embodiments of the present application is not necessarily 6, and may be set according to actual requirements. In addition, the embodiment is based on the condition that the alternating current standard of China is 220V and 50Hz, the interference of 50Hz of a power supply is mixed into a microphone for collecting voice signals, the collected interference signals and physical vibration can bring influence, the embodiment preferably collects voice signals above 80Hz to reduce the interference of the alternating current, and the highest frequency reached by the voice is 4kHz, so the embodiment is preferably divided into the frequency spectrum trough in the range of 80Hz to 4 kHz. The VAD decisions of this embodiment include Local decisions (Local Decision) and global decisions (Global Decisioin).

Further, step S101 of the present embodiment includes:

S1010: inputting each frame of data of voice signal in frequency domain signal form into GMM-NOISE and GMM-SPEECH to obtain NOISE frame probability of each frame of data And speech frame probability/>

The present embodiment obtains the GMM-NOISE and GMM-SPEECH by inputting each frame data of the voice signal, which is pre-analyzed as a voice frame or a NOISE frame, into the GMM-NOISE and GMM-SPEECH, respectively, and analyzing the NOISE frame probability value and the voice frame probability value of each frame data, respectively, so as to determine whether it is a NOISE frame or a voice frame by comparing the magnitudes of the NOISE frame probability value and the voice frame probability value.

S1011: according toA local log-likelihood ratio is calculated.

In this embodiment, the VAD model based on the gaussian mixture GMM is preferable, and it extracts energy in 6 frequency bands for each input voice signal in the form of a frame frequency domain signal, and uses the extracted energy as a feature vector of the voice signal of the frame, so in this embodiment, n takes a value of 6, and when each frame is judged, 6 local decisions are performed, that is, local decisions are performed in 6 frequency bands, and as long as the frame is considered to be a voice frame once, the frame is reserved.

S1012: and judging whether the local log likelihood ratio is higher than a local threshold value.

The embodiment realizes the distinction between the voice frame and the noise frame through the local judgment, and the local judgment of the embodiment is carried out once on each frequency band for 6 times. The likelihood ratio is an index reflecting the authenticity, belongs to a composite index reflecting the sensitivity and the specificity at the same time, improves the probability estimation accuracy, and further ensures the accuracy of judging that the voice signal is a voice frame by comparing whether the local log likelihood ratio is higher than a local threshold value under the condition that the voice frame probability value is higher than a noise frame probability value.

S1013: if yes, the frame data with the local log likelihood ratio higher than the local threshold value is judged to be the voice frame.

The GMM of the present embodiment has adaptive updating capability, and after each frame of speech signal is determined as a speech frame or a noise frame, the parameters of the corresponding model are updated according to the feature value of the frame. For example, if the frame is determined to be a SPEECH frame, the expected value, standard deviation and gaussian component weight value of the GMM-SPEECH are updated once according to the feature value of the frame, and after more SPEECH frames are input into the GMM-SPEECH, the GMM-SPEECH is more and more adapted to the voiceprint feature of the speaker of the SPEECH signal, and the analysis conclusion given is more accurate.

Further, after step S1012 of another embodiment of the present application, the method includes:

s1014: if the local log likelihood ratio is not higher than the local threshold value, then according to A global log-likelihood ratio is calculated.

In this embodiment, local decision is first performed, and then global decision is performed, where the global decision is based on the local decision result to perform calculation of weighted sums of each frequency band, so as to improve accuracy of distinguishing a speech frame from a noise frame.

S1015: and judging whether the global log likelihood ratio is higher than a global threshold value.

In the global judgment of the embodiment, the global log likelihood ratio is compared with the global threshold value, so that the accuracy of screening the voice frames is further improved.

S1016: if the global log likelihood ratio is higher than the global threshold value, judging that the frame data with the global log likelihood ratio higher than the global threshold value is a voice frame.

According to the embodiment, the global judgment is not performed when the voice exists according to the local judgment result, so that the efficiency of voiceprint verification is improved, and all voice frames can be recognized as much as possible, so that voice distortion is avoided. Other embodiments of the present application may also perform global decision after voice exists in the local decision result, so as to further verify and confirm the existence of voice, and improve the accuracy of distinguishing the voice frame from the noise frame.

Further, step S3 of the present embodiment includes:

S30: and extracting MFCC type voiceprint features corresponding to each voice frame in the purified voice data.

The procedure for extracting MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) type voiceprint features in this embodiment is as follows: sampling and quantizing the continuous analog voice signal of the purified voice data at a certain sampling period, converting the continuous analog voice signal into a discrete signal, and quantizing the discrete signal into a digital signal according to a certain coding rule; then pre-emphasis, which is to compensate the high frequency component, is often suppressed due to the physiological characteristics of the human body; then framing processing is carried out on a section of voice signal (generally, 10 to 30 milliseconds is one frame) during spectrum analysis due to the 'instantaneous stationarity' of the voice signal, and then feature extraction is carried out by taking the frame as a unit; then windowing is carried out, and the effect is to reduce the problem of discontinuity of corresponding signals of frame start and frame end, and a Hamming window is adopted for windowing; the frame signal is then DFT, converting the signal from the time domain to the frequency domain, and then mapping the signal from the linear spectrum domain to the mel spectrum domain using the following formula: Inputting the converted frame signals into a group of Mel triangle filter groups, and calculating the logarithmic energy of the signals output by the filters of each frequency band to obtain a logarithmic energy sequence; and performing discrete cosine transform (DCT, discrete Cosine Transform) on the logarithmic energy sequence obtained in the last step to obtain the MFCC type voiceprint characteristic of the frame voice signal.

S31: and constructing voiceprint feature vectors corresponding to the voice frames respectively according to the voiceprint features of the MFCC types.

The MFCC type voiceprint features have nonlinear features, so that analysis results on each frequency band are closer to the features of real voice sent by a human body, the voiceprint features are extracted more accurately, and the voiceprint verification effect is improved.

S32: and mapping each voiceprint feature vector into a low-dimensional voiceprint identification vector I-vector respectively to obtain first voiceprint features corresponding to each voice frame in the purified voice data.

The embodiment realizes that each voiceprint feature vector is respectively mapped into a low-dimensional voiceprint identification vector I-vector based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, gaussian mixture model-background model), so that the calculation cost in the voiceprint feature extraction process is reduced, and the use cost of voiceprint verification is reduced. The training process of the GMM-UBM of this embodiment is as follows: b1: acquiring a preset number (e.g., 10 ten thousand) of voice data samples, each voice data sample corresponding to a voiceprint recognition vector, each voice data sample being capable of being acquired from the voice formation of a different person in a different environment, such voice data samples being used to train a generic background model (GMM-UBM) capable of characterizing general voice characteristics; b2, processing each voice data sample to extract preset type voiceprint features corresponding to each voice data sample, and constructing voiceprint feature vectors corresponding to each voice data sample based on the preset type voiceprint features corresponding to each voice data sample; b3, dividing all constructed voiceprint feature vectors of preset types into a training set with a first percentage and a verification set with a second percentage, wherein the first percentage and the second percentage are less than or equal to 100%; b4, training the second model by utilizing the voiceprint feature vector in the training set, and verifying the accuracy of the trained second model by utilizing the verification set after the training is completed; and B5, if the accuracy is greater than the preset accuracy (for example, 98.5%), finishing model training, otherwise, increasing the number of voice data samples, and re-executing the steps B2, B3, B4 and B5 based on the increased voice data samples.

The voiceprint identification vector of the embodiment is expressed by adopting an I-vector, the I-vector is a vector, compared with the dimension of a Gaussian space, the dimension of the I-vector is lower, so that the calculation cost is reduced, and the process of extracting the I-vector with low dimension is to multiply and map the vector w with low dimension and a transformation matrix T to the Gaussian space with higher dimension through the following calculation formula. The extraction of the I-vector comprises the following steps: after training voice data from a certain target speaker is processed, the extracted preset type voiceprint feature vector (for example, MFCC) is input into a GMM-VAD model, and a Gaussian super vector representing probability distribution of the voice data on each Gaussian component is obtained; the lower-dimension voiceprint discrimination vector I-vector corresponding to the section of voice can be calculated by using the following formula, wherein m _r is a Gaussian supervector representing the section of voice, mu is a mean supervector of the second model, T is a conversion matrix mapping the low-dimension I-vector and omega _r to a high-dimension Gaussian space, and training of T adopts an EM algorithm.

Further, step S4 of the present embodiment includes:

s40: respectively acquiring corresponding pre-stored voiceprint features from the pre-stored voiceprint feature data of a plurality of persons, wherein the voiceprint feature data of the plurality of persons comprises pre-stored voiceprint features of a target person.

According to the embodiment, the voice print characteristic data of multiple persons including the target person are pre-stored, and meanwhile, whether the voice print characteristic of the currently collected voice signal is identical to the voice print characteristic of the target person or not is judged, so that judgment accuracy is improved.

S41: and respectively calculating similarity values between each pre-stored voiceprint feature and the first voiceprint feature.

The similarity value in this embodiment characterizes the similarity between the pre-stored voiceprint feature and the first voiceprint feature, and the greater the similarity value, the more similar the two are. The similarity value obtaining method of the embodiment includes obtaining a feature distance value between a pre-stored voiceprint feature and the first voiceprint feature by comparing the feature distance value, wherein the feature distance value includes a cosine distance value, a euclidean distance value and the like.

S42: and sorting the similarity values in order from big to small.

The embodiment ranks the similarity values between each pre-stored voiceprint feature and the first voiceprint feature from large to small so as to more accurately analyze the similarity distribution state of the first voiceprint feature and each pre-stored voiceprint feature, and further accurately obtain verification of the first voiceprint feature.

S43: judging whether the similarity values of the preset number of the prior ordered similarity values comprise the similarity values corresponding to the pre-stored voiceprint features of the target person.

In this embodiment, by sorting the previous preset number of similarity values, including the similarity value corresponding to the pre-stored voiceprint feature of the target person, it is determined that the first voiceprint feature is identical to the pre-stored voiceprint feature of the target person, so as to reduce an identification error rate caused by a model error, where the error rate is "the frequency of verification failure occurring when verification should pass and the frequency of verification pass occurring when verification should not pass". The preset number of similarity values in this embodiment includes 1, 2, or 3, and may be set according to the use requirement.

S44: if yes, judging that the similarity of the first voiceprint features and the pre-stored voiceprint features meets the preset condition, otherwise, not meeting the preset condition.

According to other embodiments of the application, effective voiceprint verification is realized by setting a distance threshold between the first voiceprint feature and the pre-stored voiceprint feature of the target user. For example, the preset threshold is 0.6, if the cosine distance between the first voiceprint feature and the prestored voiceprint feature of the target user is less than or equal to the preset threshold, determining that the first voiceprint feature is the same as the prestored voiceprint feature of the target user, and if the cosine distance is less than or equal to the preset threshold, verifying; ; if the cosine distance between the first voiceprint feature and the prestored voiceprint feature of the target user is calculated to be larger than the preset threshold value, the fact that the first voiceprint feature is different from the prestored voiceprint feature of the target user is determined, and verification fails.

Further, step S41 of the present embodiment includes:

s410: respectively through cosine distance formula And calculating cosine distance values between each pre-stored voiceprint feature and the first voiceprint feature, wherein x represents each pre-stored voiceprint authentication vector and y represents the voiceprint authentication vector of the first voiceprint feature.

The embodiment uses a cosine distance formulaAnd representing the similarity between each pre-stored voiceprint feature and the first voiceprint feature, wherein the smaller the distance value of the cosine distance is, the closer or the same two voiceprint features are indicated.

S411: and converting the cosine distance value into the similarity value, wherein the smallest cosine distance value corresponds to the largest similarity value.

The present embodiment may convert the cosine distance value into the similarity value by inverting the cosine distance value according to an inverse proportion formula that carries a specified inverse proportion coefficient.

According to the embodiment, the noise data in the voice signal is identified, the noise data is removed to obtain purified voice data, and then voiceprint identification is carried out according to the purified voice data, so that the voiceprint verification accuracy is improved. According to the embodiment, through the GMM-VAD model and combining local judgment and global judgment, accurate distinction of noise data and voice data is achieved, so that the degree of purifying voice signals is improved, and the accuracy of voiceprint verification is further improved. According to the embodiment, the voiceprint feature vectors are respectively mapped into the voiceprint identification vectors I-vector with low dimensionality based on the GMM-UBM, so that the calculation cost in the voiceprint feature extraction process is reduced, and the use cost of voiceprint verification is reduced. In the voice print verification process, the error rate of voice print verification is reduced by comparing and analyzing the voice print verification process with prestored data of multiple persons, and errors of voice print verification accuracy caused by model errors of voice print verification are reduced.

Referring to fig. 2, an apparatus for voiceprint authentication according to an embodiment of the present application includes:

The distinguishing module 1 is used for inputting the voice signal to be voiceprint verified into the VAD model and distinguishing the voice frame and the noise frame in the voice signal.

And the removing module 2 is used for removing the noise frames to obtain purified voice data formed by the voice frames.

And the extraction module 3 is used for extracting the first voiceprint features corresponding to the purified voice data.

And the judging module 4 is used for judging whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets a preset condition.

And the judging module 5 is used for judging that the first voiceprint characteristics are the same as the pre-stored voiceprint characteristics if the preset conditions are met, and otherwise, the first voiceprint characteristics are different.

Further, the VAD model of the present embodiment includes fourier transform, GMM-NOISE and GMM-SPEECH of gaussian mixture distribution, and the differentiating module 1 includes:

and the conversion unit is used for inputting the voice signal into Fourier transform in the VAD model and converting the voice signal from a time domain signal form to a frequency domain signal form.

And the distinguishing unit is used for respectively inputting each frame data of the voice signal in the frequency domain signal form into the GMM-NOISE and the GMM-SPEECH to carry out VAD judgment so as to distinguish voice frames and NOISE frames in the voice signal.

Further, the distinguishing unit of the present embodiment includes:

An input subunit for inputting each frame data of the voice signal in the frequency domain signal form into the GMM-NOISE and the GMM-SPEECH respectively to obtain NOISE frame probability of each frame data And speech frame probability/>

A first computing subunit for according toA local log-likelihood ratio is calculated.

And the first judging subunit is used for judging whether the local log likelihood ratio is higher than a local threshold value.

And the first judging subunit is used for judging that the frame data with the local log likelihood ratio higher than the local threshold value is a voice frame if the local log likelihood ratio is higher than the local threshold value.

Further, a distinguishing unit according to another embodiment of the present application includes:

a second calculation subunit for, if the local log likelihood ratio is not higher than the local threshold value, according to A global log-likelihood ratio is calculated.

And the second judging subunit is used for judging whether the global log likelihood ratio is higher than a global threshold value.

And the second judging subunit is used for judging that the frame data with the global log-likelihood ratio higher than the global threshold value is a voice frame if the global log-likelihood ratio is higher than the global threshold value.

Further, the extraction module 3 of the present embodiment includes:

And the extraction unit is used for extracting MFCC type voiceprint characteristics corresponding to each voice frame in the purified voice data.

The procedure for extracting MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) type voiceprint features in this embodiment is as follows: sampling and quantizing the continuous analog voice signal of the purified voice data at a certain sampling period, converting the continuous analog voice signal into a discrete signal, and quantizing the discrete signal into a digital signal according to a certain coding rule; then pre-emphasis, which is to compensate the high frequency component, is often suppressed due to the physiological characteristics of the human body; then framing processing is carried out on a section of voice signal (generally, 10 to 30 milliseconds is one frame) during spectrum analysis due to the 'instantaneous stationarity' of the voice signal, and then feature extraction is carried out by taking the frame as a unit; then windowing is carried out, and the effect is to reduce the problem of discontinuity of corresponding signals of frame start and frame end, and a Hamming window is adopted for windowing; the frame signal is then DFT, converting the signal from the time domain to the frequency domain, and then mapping the signal from the linear spectrum domain to the mel spectrum domain using the following formula: Inputting the converted frame signals into a group of Mel triangle filter groups, and calculating the logarithmic energy of the signals output by the filters of each frequency band to obtain a logarithmic energy sequence; and performing discrete cosine transform (DCT, discrete Cosine Transform) on the logarithmic energy sequence obtained in the last step to obtain the MFCC type voiceprint characteristic of the frame voice signal. /(I)

And the construction unit is used for constructing voiceprint feature vectors corresponding to the voice frames respectively according to the voiceprint features of the MFCC types.

The mapping unit is used for mapping each voiceprint feature vector into a voiceprint identification vector I-vector with low dimensionality respectively so as to obtain first voiceprint features corresponding to each voice frame in the purified voice data.

Further, the judging module 4 of the present embodiment includes:

the voice print processing device comprises an acquisition unit, a target person processing unit and a target person processing unit, wherein the acquisition unit is used for acquiring respective corresponding pre-stored voice print characteristics from voice print characteristic data of a plurality of pre-stored persons respectively, and the voice print characteristic data of the plurality of persons comprise the pre-stored voice print characteristics of the target person.

And the calculating unit is used for calculating the similarity value between each pre-stored voiceprint feature and the first voiceprint feature respectively.

And the sorting unit is used for sorting the similarity values in order from big to small.

And the judging unit is used for judging whether the similarity value corresponding to the prestored voiceprint features of the target person is included in the preset number of similarity values which are ranked in front.

And the judging unit is used for judging that the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets the preset condition if the similarity value corresponding to the pre-stored voiceprint feature of the target person is included, and if the similarity value does not meet the preset condition, the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets the preset condition.

Further, the computing unit of the present embodiment includes:

A third calculation subunit for respectively passing through cosine distance formulas And calculating cosine distance values between each pre-stored voiceprint feature and the first voiceprint feature, wherein x represents each pre-stored voiceprint authentication vector and y represents the voiceprint authentication vector of the first voiceprint feature.

And the conversion subunit is used for converting the cosine distance value into the similarity value, wherein the smallest cosine distance value corresponds to the largest similarity value.

Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required for the voiceprint verification process. The network interface of the computer device is for communicating with an external terminal via a network connection. The computer program is executed by a processor to implement a method of voiceprint authentication.

The method for executing the voiceprint verification by the processor comprises the following steps: inputting a voice signal to be voiceprint verified into a VAD model, and distinguishing a voice frame and a noise frame in the voice signal; removing the noise frames to obtain purified voice data formed by the voice frames; extracting first voiceprint features corresponding to the purified voice data; judging whether the similarity of the first voiceprint features and the pre-stored voiceprint features meets a preset condition or not; if yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same.

According to the computer equipment, the noise data in the voice signal is identified, the noise data is removed to obtain the purified voice data, and then voiceprint identification is carried out according to the purified voice data, so that the voiceprint verification accuracy is improved. Through the GMM-VAD model, the accurate distinction between noise data and voice data is realized by combining local judgment and global judgment, so that the degree of purifying voice signals is improved, and the accuracy of voiceprint verification is further improved. Based on GMM-UBM, each voiceprint feature vector is respectively mapped into a voiceprint identification vector I-vector with low dimensionality, so that the calculation cost in the voiceprint feature extraction process is reduced, and the use cost of voiceprint verification is reduced. In the voiceprint verification process, the error rate of voiceprint verification is reduced and the error of voiceprint verification accuracy caused by model error of voiceprint verification is reduced by comparing and analyzing the voiceprint verification process with prestored data of multiple persons.

In one embodiment, the VAD model includes a fourier transform, a gaussian mixture distribution of GMM-NOISE and GMM-SPEECH, and the processor inputs the SPEECH signal into the VAD model, and the step of distinguishing each SPEECH frame from each NOISE frame in the SPEECH signal includes: inputting the voice signal into Fourier transform in VAD model, converting the voice signal from time domain signal form to frequency domain signal form; each frame of data of the SPEECH signal in the form of a frequency domain signal is input into the GMM-NOISE and GMM-SPEECH, respectively, for VAD decisions to distinguish between SPEECH frames and NOISE frames in the SPEECH signal.

In one embodiment, the step of the processor inputting each frame data of the voice signal in the form of a frequency domain signal into the GMM-NOISE and GMM-SPEECH to make VAD decisions to distinguish between voice frames and NOISE frames in the voice signal, comprises: inputting each frame of data of voice signal in frequency domain signal form into GMM-NOISE and GMM-SPEECH to obtain NOISE frame probability of each frame of dataAnd speech frame probability/>According toCalculating a local log-likelihood ratio; judging whether the local log likelihood ratio is higher than a local threshold value or not; if yes, judging that the frame data with the local log likelihood ratio higher than a local threshold value is a voice frame.

In one embodiment, after the step of determining whether the log likelihood ratio is higher than a local threshold, the processor includes: if the local log likelihood ratio is not higher than the local threshold value, then according toCalculating a global log-likelihood ratio; judging whether the global log likelihood ratio is higher than a global threshold value or not; if the global log-likelihood ratio is higher than the global threshold value, judging that the frame data with the global log-likelihood ratio higher than the global threshold value is a voice frame.

In one embodiment, the step of extracting, by the processor, a first voiceprint feature corresponding to the purified voice data includes: extracting MFCC type voiceprint features corresponding to each voice frame in the purified voice data; constructing voiceprint feature vectors corresponding to the voice frames respectively according to the voiceprint features of the MFCC types; and mapping each voiceprint feature vector into a low-dimensional voiceprint identification vector I-vector respectively to obtain first voiceprint features corresponding to each voice frame in the purified voice data.

In one embodiment, the step of determining, by the processor, whether the similarity between the first voiceprint feature and the pre-stored voiceprint feature meets a preset condition includes: respectively acquiring corresponding pre-stored voiceprint features from the pre-stored voiceprint feature data of a plurality of persons, wherein the voiceprint feature data of the plurality of persons comprises pre-stored voiceprint features of a target person; respectively calculating similarity values between each pre-stored voiceprint feature and the first voiceprint feature; sorting the similarity values according to the order from big to small; judging whether the similarity values of the preset number of the sequenced voice print features of the target person are included or not; if yes, judging that the similarity of the first voiceprint features and the pre-stored voiceprint features meets the preset condition, otherwise, not meeting the preset condition.

In one embodiment, the step of calculating, by the processor, a similarity value between each of the pre-stored voiceprint features and the first voiceprint feature, respectively, includes: respectively through cosine distance formulaCalculating cosine distance values between each pre-stored voiceprint feature and the first voiceprint feature, wherein x represents each pre-stored voiceprint identification vector and y represents the voiceprint identification vector of the first voiceprint feature; and converting the cosine distance value into the similarity value, wherein the smallest cosine distance value corresponds to the largest similarity value.

It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of voiceprint authentication, comprising: inputting a voice signal to be voiceprint verified into a VAD model, and distinguishing a voice frame and a noise frame in the voice signal; removing the noise frames to obtain purified voice data formed by the voice frames; ; extracting first voiceprint features corresponding to the purified voice data; judging whether the similarity of the first voiceprint features and the pre-stored voiceprint features meets a preset condition or not; if yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same.

The computer readable storage medium can obtain purified voice data by recognizing noise data in the voice signal and removing the noise data, and then can carry out voiceprint recognition according to the purified voice data, thereby improving the accuracy of voiceprint verification. Through the GMM-VAD model, the accurate distinction between noise data and voice data is realized by combining local judgment and global judgment, so that the degree of purifying voice signals is improved, and the accuracy of voiceprint verification is further improved. Based on GMM-UBM, each voiceprint feature vector is respectively mapped into a voiceprint identification vector I-vector with low dimensionality, so that the calculation cost in the voiceprint feature extraction process is reduced, and the use cost of voiceprint verification is reduced. In the voiceprint verification process, the error rate of voiceprint verification is reduced and the error of voiceprint verification accuracy caused by model error of voiceprint verification is reduced by comparing and analyzing the voiceprint verification process with prestored data of multiple persons.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method of voiceprint authentication comprising:

extracting first voiceprint features corresponding to the purified voice data;

If yes, judging that the first voiceprint features are the same as the pre-stored voiceprint features, otherwise, judging that the first voiceprint features are not the same;

The VAD model judges that each frame of voice signal is a voice frame or a noise frame by scoring the inputted voice signal of each frame, and when the probability value of the voice frame is larger than a preset judgment threshold, the voice frame is judged, otherwise, the voice frame is judged to be the noise frame;

The VAD model comprises GMM-NOISE and GMM-SPEECH with Fourier transform and Gaussian mixture distribution, and the step of inputting the voice signal into the VAD model to distinguish voice frames and NOISE frames in the voice signal comprises the following steps:

Inputting each frame data of the voice signal in the frequency domain signal form into the GMM-NOISE and the GMM-SPEECH respectively for VAD judgment so as to distinguish voice frames and NOISE frames in the voice signal;

The step of inputting each frame data of the voice signal in the frequency domain signal form into the GMM-NOISE and GMM-SPEECH to make VAD decision to distinguish the voice frame from the NOISE frame in the voice signal, includes:

According toCalculating a local log-likelihood ratio;

2. The method of voiceprint verification according to claim 1, wherein after the step of determining whether the log likelihood ratio is above a local threshold, comprising:

3. The method of voiceprint verification according to claim 1, wherein the step of extracting a first voiceprint feature corresponding to the cleaned speech data comprises:

4. A method of voiceprint verification according to claim 3, wherein the step of determining whether the similarity of the first voiceprint feature to a pre-stored voiceprint feature meets a predetermined condition comprises:

Sorting the similarity values according to the order from big to small;

5. The method of voiceprint verification according to claim 4, wherein the step of separately calculating similarity values between each of the pre-stored voiceprint features and the first voiceprint feature comprises:

6. A voiceprint authentication apparatus comprising the method of any one of claims 1-5, comprising:

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.