WO2018166112A1

WO2018166112A1 - Voiceprint recognition-based identity verification method, electronic device, and storage medium

Info

Publication number: WO2018166112A1
Application number: PCT/CN2017/091361
Authority: WO
Inventors: 王健宗; 丁涵宇; 郭卉; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-03-13
Filing date: 2017-06-30
Publication date: 2018-09-20
Also published as: WO2018166187A1; TWI641965B; TW201833810A; CN107517207A; CN107068154A

Abstract

The present invention relates to a voiceprint recognition-based identity verification method, an electronic device, and a storage medium. The voiceprint recognition-based identity verification method comprises: after receiving voice data of a user performing identity verification, obtaining a voiceprint feature of the voice data, and building a corresponding voiceprint feature vector on the basis of the voiceprint feature; inputting the voiceprint feature vector into a background channel model generated by training in advance to build a current voiceprint identification vector corresponding to the voice data; and calculating a space distance between the current voiceprint identification vector and a pre-stored standard voiceprint identification vector of the user, performing identity verification on the user on the basis of the distance, and generating a verification result. The present invention can improve the accuracy and efficiency of user identity verification.

Description

Method, electronic device and storage medium for identity verification based on voiceprint recognition

Priority claim

This application is based on the priority of the Chinese Patent Application entitled "Method and System for Voiceprint Recognition Based on Voiceprint Recognition" filed on March 13, 2017, with the application number CN201710147695X, which is filed on March 13, 2017, the entire contents of which are The manner of reference is incorporated in the present application.

Technical field

The present invention relates to the field of communications technologies, and in particular, to a method, an electronic device, and a storage medium for identity verification based on voiceprint recognition.

Background technique

At present, the business scope of large financial companies involves insurance, banking, investment and other business areas. Each business category usually needs to communicate with customers, and there are many ways to communicate (such as telephone communication or face-to-face communication). Before communicating, verifying the identity of the customer becomes an important part of ensuring business security. In order to meet the real-time needs of the business, financial companies usually use manual methods to analyze and verify the identity of customers. Due to the large customer base, relying on manual discriminant analysis to verify the identity of the customer is not accurate or efficient.

Summary of the invention

It is an object of the present invention to provide a voiceprint recognition based authentication method, an electronic device and a storage medium, which aim to improve the accuracy and efficiency of user identity verification.

A first aspect of the present invention provides a voiceprint recognition based authentication method, and the voiceprint recognition based identity verification method includes:

S1, after receiving the voice data of the user who performs the authentication, acquiring the voiceprint feature of the voice data, and constructing a corresponding voiceprint feature vector based on the voiceprint feature;

S2, input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

S3. Calculate a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.

A second aspect of the present invention provides an electronic device, including a processing device, a storage device, and a voiceprint recognition-based identity verification system, wherein the voiceprint recognition-based identity verification system is stored in the storage device, including at least one computer Reading instructions, the at least one computer readable instruction being executable by the processing device to:

A third aspect of the invention provides a computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:

The invention has the beneficial effects that the background channel model generated by the pre-training of the present invention is obtained by mining and comparing a large amount of voice data, and the model can accurately describe the user's voice while retaining the user's voiceprint feature to the utmost extent. The background voiceprint feature can be removed at the time of recognition, and the intrinsic feature of the user voice can be extracted, which can greatly improve the accuracy of user identity verification and improve the efficiency of identity verification.

DRAWINGS

1 is a schematic diagram of an application environment of a preferred embodiment of a voiceprint recognition based authentication method according to the present invention;

2 is a schematic flow chart of a preferred embodiment of a voiceprint recognition based identity verification method according to the present invention;

FIG. 3 is a schematic diagram showing the refinement process of step S1 shown in FIG. 2;

4 is a schematic diagram showing the refinement process of step S3 shown in FIG. 2;

FIG. 5 is a schematic structural diagram of a system for authenticating a voiceprint recognition based authentication method according to the present invention.

detailed description

The principles and features of the present invention are described in the following with reference to the accompanying drawings.

Referring to FIG. 1 , it is a schematic diagram of an application environment of a preferred embodiment of a method for implementing voiceprint recognition based identity verification according to the present invention. The application environment diagram includes an electronic device 1 and a terminal device 2. The electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.

The terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, or an individual. Personal Digital Assistant (PDA), game console, Internet Protocol Television (IPTV), smart wearable device, etc.

The electronic device 1 is an automatic numerical meter capable of automatically setting according to an instruction set or stored in advance. Equipment for calculation and/or information processing. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing, A super virtual computer consisting of a loosely coupled set of computers.

In the present embodiment, the electronic device 1 includes, but is not limited to, a storage device 11, a processing device 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.

The storage device 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1. A storage device, such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like. In this embodiment, the readable storage medium of the storage device 11 is generally used to store an operating system installed on the electronic device 1 and various types of application software, such as the voiceprint recognition-based identity verification system 10 in an embodiment of the present application. Program code, etc. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.

Processing device 12 may, in some embodiments, include one or more microprocessors, microcontrollers, digital processors, and the like. The processing device 12 is generally used to control the operation of the electronic device 1, for example, to perform control and processing related to data interaction or communication with the terminal device 2. In the present embodiment, the processing device 12 is operative to run program code or process data stored in the storage device 11, such as a system 10 that runs voiceprint recognition based authentication.

The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.

The voiceprint recognition based authentication system 10 includes at least one computer readable instructions stored in the storage device 11, the at least one computer readable instructions being executable by the processing device 12 to implement a voiceprint based on embodiments of the present application. The method of identifying the authentication. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.

In an embodiment, when the voiceprint recognition based authentication system 10 is executed by the processing device 12, the following operations are performed: first, after receiving the voice data of the authenticated user, acquiring the voiceprint feature of the voice data And constructing a corresponding voiceprint feature vector based on the voiceprint feature; then inputting the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data; a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, based on the distance Authenticate and generate verification results.

As shown in FIG. 2, FIG. 2 is a schematic flowchart of a method for authenticating a voiceprint recognition based authentication method according to the present invention. The method for authenticating identity based on voiceprint recognition in this embodiment is not limited to the steps shown in the process. Further, in the steps shown in the flowchart, some steps may be omitted, and the order between the steps may be changed. The method for voiceprint recognition based authentication includes the following steps:

Step S1, after receiving the voice data of the user who performs the authentication, acquiring the voiceprint feature of the voice data, and constructing a corresponding voiceprint feature vector based on the voiceprint feature;

In this embodiment, the voice data is collected by the voice collection device (the voice collection device is, for example, a microphone), and the voice collection device sends the collected voice data to the voice recognition-based identity verification system.

When collecting voice data, you should try to prevent environmental noise and interference from voice acquisition equipment. The voice collection device maintains an appropriate distance from the user, and tries not to use a large voice acquisition device. The power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording the telephone. The voice data may be denoised prior to extracting the voiceprint features in the voice data to further reduce interference. In order to extract the voiceprint feature of the voice data, the collected voice data is voice data of a preset data length, or voice data greater than a preset data length.

The voiceprint features include various types, such as wide-band voiceprint, narrow-band voiceprint, amplitude voiceprint, etc., and the voiceprint feature of the present embodiment is a Mel Frequency Cepstrum Coefficient (MFCC), which is preferably voice data. . When constructing the corresponding voiceprint feature vector, the voiceprint feature of the voice data is composed into a feature data matrix, which is a voiceprint feature vector of the voice data.

Step S2: input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

The voiceprint feature vector is input into the background channel model generated by the pre-training. Preferably, the background channel model is a Gaussian mixture model, and the background channel model is used to calculate the voiceprint feature vector to obtain a corresponding current voiceprint discrimination vector ( I-vector).

Specifically, the calculation process includes:

1) Select the Gaussian model: First, use the parameters in the general background channel model to calculate the likelihood value of each frame of data in different Gaussian models. By sorting the columns of the likelihood logarithmic matrix in parallel, select the first N Gaussian models. Finally, a matrix of values per frame of data in the mixed Gaussian model is obtained:

Loglike=E(X)*D(X) ^-1 *X ^T -0.5*D(X) ^-1 *(X. ² ) ^T ,

Among them, Loglike is a likelihood logarithmic matrix, E(X) is a mean matrix trained by a general background channel model, D(X) is a covariance matrix, X is a data matrix, and X. ² is a square of each value of the matrix.

2) Calculate the posterior probability: X*XT calculation is performed on each frame of data X to obtain a symmetric matrix, which can be simplified into a lower triangular matrix, and the elements are arranged in order of 1 row, and become an N frame multiplied by the next A vector of the latitude of the triangular matrix is calculated, and the vectors of all the frames are combined into a new data matrix, and the covariance matrix of the probability is calculated in the general background model, and each matrix is also simplified into a lower triangular matrix, which becomes A matrix similar to the new data matrix, the likelihood ratio of each frame of data in the selected Gaussian model is calculated by the mean matrix and the covariance matrix in the universal background channel model The values are then subjected to Softmax regression, and finally normalized, and the posterior probability distribution of each frame in the mixed Gaussian model is obtained, and the probability distribution vector of each frame is composed into a probability matrix.

3) Extract the current voiceprint discrimination vector: firstly calculate the first-order and second-order coefficients, and the first-order coefficient calculation can be obtained by summing the probability matrix:

Among them, Gamma _i is the i-th element of the first-order coefficient vector, and loglikes _ji is the j-th row of the probability matrix, the i-th element.

The second-order coefficients can be obtained by multiplying the transposition of the probability matrix by the data matrix:

X=Loglike ^T *feats, where X is a second-order coefficient matrix, loglike is a probability matrix, and feats is a feature data matrix.

After the first-order and second-order coefficients are calculated, the primary term and the quadratic term are calculated in parallel, and then the current voiceprint discrimination vector is calculated by the primary term and the quadratic term.

Preferably, the background channel model is a Gaussian mixture model, and before the step S1, the method includes:

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.

When the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, the likelihood probability corresponding to the extracted D-dimensional voiceprint feature can be expressed by K Gaussian components:

P(x) is the probability that the speech data samples are generated by the Gaussian mixture model (mixed Gaussian model), w _k is the weight of each Gaussian model, and p(x|k) is the probability that the sample is generated by the kth Gaussian model. , K is the number of Gaussian models.

The parameters of the entire Gaussian mixture model can be expressed as: {w _i , μ _i , Σ _i }, w _i is the weight of the i-th Gaussian model, μ _i is the mean of the i-th Gaussian model, and ∑ _i is the i-th Gaussian The covariance of the model. Training the Gaussian mixture model can use an unsupervised EM algorithm. After the training is completed, the Gaussian mixture model weight vector, constant vector, N covariance matrix, and the mean multiplied by the covariance matrix are obtained, which is a trained Gaussian mixture model.

Step S3: Calculate a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.

There are various distances between the vector and the vector, including the cosine distance and the Euclidean distance, etc., preferably, The spatial distance of this embodiment is a cosine distance, which is a measure of the magnitude of the difference between two individuals using the cosine of the angles of the two vectors in the vector space.

The standard voiceprint discriminant vector is a voiceprint discriminant vector obtained and stored in advance, and the standard voiceprint discriminant vector carries the identifier information of the corresponding user when stored, which can accurately represent the identity of the corresponding user. The stored voiceprint discrimination vector is obtained according to the identification information provided by the user before calculating the spatial distance.

Wherein, when the calculated spatial distance is less than or equal to the preset distance threshold, the verification passes, and vice versa, the verification fails.

Compared with the prior art, the background channel model generated by the pre-training in this embodiment is obtained by mining and comparing a large amount of voice data, and the model can accurately depict the user while maximally retaining the voiceprint features of the user. The background voiceprint feature when speaking, and can remove this feature when identifying, and extracting the inherent features of the user voice, can greatly improve the accuracy of the user identity verification, and improve the efficiency of the identity verification; It makes full use of the voiceprint features related to the vocal vocal in the human voice. This voiceprint feature does not need to limit the text, so it has greater flexibility in the process of identification and verification.

In a preferred embodiment, as shown in FIG. 3, based on the foregoing embodiment of FIG. 2, the foregoing step S1 includes:

Step S11: Perform pre-emphasis, framing, and windowing on the voice data. In this embodiment, after receiving the voice data of the user performing the identity verification, the voice data is processed. Wherein the pre-emphasis actually a high-pass filtering, to filter out low-frequency data, so that high-frequency characteristics of the speech data more highlighted, in particular, the high pass filter transfer function: H (Z) = 1- αZ -1, wherein, Z is voice data, α is a constant coefficient, preferably, the value of α is 0.97; since the sound signal exhibits smoothness only in a short time, a sound signal is divided into N short-time signals (ie, N frames). In order to avoid the loss of the continuity feature of the sound, there is a repeating area between adjacent frames, and the repeating area is generally 1/2 of the length of each frame; after the framed speech data, each frame signal is regarded as a stationary signal. To deal with, but the existence of the Gibbs effect, the start frame and the end frame of the speech data are discontinuous, and after the framing, the original speech is further deviated. Therefore, the voice data needs to be windowed.

Step S12, performing Fourier transform on each window to obtain a corresponding spectrum;

Step S13, input the spectrum into a mel filter to output a mega spectrum;

Step S14, performing cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and composing a corresponding voiceprint feature vector based on the Mel frequency cepstral coefficient MFCC. The cepstrum analysis is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients. The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the speech data.

In a preferred embodiment, as shown in FIG. 4, based on the embodiment of FIG. 2 above, Step S3 includes:

Step S31, calculating a cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user:

Identifying the vector for the standard voiceprint,

Identify the vector for the current voiceprint;

Step S32, if the cosine distance is less than or equal to a preset distance threshold, generating verification pass information;

Step S33: If the cosine distance is greater than a preset distance threshold, generate information that the verification fails.

In a preferred embodiment, based on the foregoing embodiment of FIG. 2, the foregoing step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors, and obtaining The smallest spatial distance, the user is authenticated based on the minimum spatial distance, and a verification result is generated.

The difference between the embodiment and the embodiment of FIG. 1 is that the embodiment does not carry the identification information of the user when storing the standard voiceprint authentication vector, and calculates the current voiceprint authentication vector and the pre-stored standard when verifying the identity of the user. The voiceprint discriminates the spatial distance between the vectors and obtains a minimum spatial distance. If the minimum spatial distance is less than a preset distance threshold (the distance threshold is the same as or different from the distance threshold of the above embodiment), the verification passes, otherwise verification failed.

Please refer to FIG. 5, which is a functional block diagram of a preferred embodiment of the voiceprint recognition based authentication system 10 of the present invention. In this embodiment, the voiceprint recognition based authentication system 10 can be partitioned into one or more modules, one or more modules being stored in a memory and executed by one or more processors to complete this invention. For example, in FIG. 5, the voiceprint recognition based authentication system 10 can be divided into a detection module 21, an identification module 22, a replication module 23, an installation module 24, and a startup module 25. The term "module" as used in the present invention refers to a series of computer program instruction segments capable of performing a specific function, which is more suitable than the program for describing the execution of the voiceprint recognition based authentication system 10 in an electronic device, wherein:

The first obtaining module 101 is configured to acquire a voiceprint feature of the voice data after receiving the voice data of the user who performs the identity verification, and construct a corresponding voiceprint feature vector based on the voiceprint feature;

When collecting voice data, you should try to prevent environmental noise and interference from voice acquisition equipment. The voice collection device maintains an appropriate distance from the user, and tries not to use a large voice acquisition device. The power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording the telephone. The voice data may be denoised prior to extracting the voiceprint features in the voice data to further reduce interference. In order to extract the voiceprint feature of the voice data, the collected voice data is a preset number. According to the length of the voice data, or the voice data is greater than the preset data length.

The constructing module 102 is configured to input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

Specifically, the calculation process includes:

Loglike=E(X)*D(X) ^-1 *X ^T -0.5*D(X) ^-1 *(X. ² ) ^T ,

2) Calculate the posterior probability: X*XT calculation is performed on each frame of data X to obtain a symmetric matrix, which can be simplified into a lower triangular matrix, and the elements are arranged in order of 1 row, and become an N frame multiplied by the next A vector of the latitude of the triangular matrix is calculated, and the vectors of all the frames are combined into a new data matrix, and the covariance matrix of the probability is calculated in the general background model, and each matrix is also simplified into a lower triangular matrix, which becomes A matrix similar to the new data matrix, the likelihood log value of each frame of data in the selected Gaussian model is calculated by the mean matrix and the covariance matrix in the general background channel model, and then Softmax regression is performed, and finally the normalization operation is performed. The posterior probability distribution of each frame in the mixed Gaussian model is obtained, and the probability distribution vector of each frame is composed into a probability matrix.

Preferably, the background channel model is a Gaussian mixture model, and the voiceprint recognition based authentication system further comprises:

a second acquiring module, configured to acquire a preset number of voice data samples, and obtain each voice data a voiceprint feature corresponding to the sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on the voiceprint feature corresponding to each voice data sample;

a dividing module, configured to divide the voiceprint feature vector corresponding to each voice data sample into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

a training module is configured to train the Gaussian mixture model by using the voiceprint feature vector in the training set, and after the training is completed, verify the accuracy of the trained Gaussian mixture model by using the verification set;

a processing module, if the accuracy is greater than a preset threshold, the model training ends, the trained Gaussian mixture model is used as the background channel model, or, if the accuracy is less than or equal to a preset threshold, The number of speech data samples is described and re-trained based on the increased speech data samples.

The first verification module 103 is configured to calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.

There are various distances between the vector and the vector, including the cosine distance and the Euclidean distance, etc. Preferably, the spatial distance of the present embodiment is a cosine distance, and the cosine distance is a cosine value of the angle between two vectors in the vector space. A measure of the magnitude of the difference between two individuals.

In a preferred embodiment, based on the foregoing embodiment of FIG. 5, the first acquiring module 101 is specifically configured to perform pre-emphasis, framing, and windowing on the voice data; Fourier transform to obtain the corresponding spectrum; input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, A corresponding voiceprint feature vector is formed based on the Mel frequency cepstral coefficient MFCC.

The pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency characteristics in the speech data are more prominent. Specifically, the transfer function of the high-pass filter is: H(Z)=1-αZ ^-1 , wherein Z is voice data, α is a constant coefficient, preferably, the value of α is 0.97; since the sound signal exhibits smoothness only in a short time, a sound signal is divided into N short-time signals (ie, N frames). In order to avoid the loss of the continuity feature of the sound, there is a repeating area between adjacent frames, and the repeating area is generally 1/2 of the length of each frame; after the framed speech data, each frame signal is regarded as a stationary signal. To deal with, but the existence of the Gibbs effect, the start frame and the end frame of the speech data are discontinuous, and after the framing, the original speech is further deviated. Therefore, the voice data needs to be windowed.

The cepstrum analysis is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients. The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the speech data.

In a preferred embodiment, based on the foregoing embodiment of FIG. 5, the first verification module 103 is specifically configured to calculate between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user. Cosine distance:

Identifying the vector for the standard voiceprint,

And identifying the vector for the current voiceprint; if the cosine distance is less than or equal to the preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.

In a preferred embodiment, based on the foregoing embodiment of FIG. 5, the first verification module is replaced by a second verification module, configured to calculate the current voiceprint discrimination vector and pre-stored standard voiceprint identification. The spatial distance between the vectors, the minimum spatial distance is obtained, the user is authenticated based on the minimum spatial distance, and a verification result is generated.

The difference between the embodiment and the embodiment of FIG. 5 is that the present embodiment does not carry the identification information of the user when storing the standard voiceprint authentication vector, and calculates the current voiceprint authentication vector and the pre-stored standard when verifying the identity of the user. The voiceprint discriminates the spatial distance between the vectors and obtains a minimum spatial distance. If the minimum spatial distance is less than a preset distance threshold (the distance threshold is the same as or different from the distance threshold of the above embodiment), the verification passes, otherwise verification failed.

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A method for identity verification based on voiceprint recognition, characterized in that the method for authenticating voiceprint recognition based authentication comprises:

S1, after receiving the voice data of the user who performs the authentication, acquiring the voiceprint feature of the voice data, and constructing a corresponding voiceprint feature vector based on the voiceprint feature;

S2, input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

S3. Calculate a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.
The voiceprint recognition-based authentication method according to claim 1, wherein the background channel model is a Gaussian mixture model, and the step S1 comprises:

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
The method according to claim 1, wherein the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors. Obtaining a minimum spatial distance, authenticating the user based on the minimum spatial distance, and generating a verification result.
The voiceprint recognition based authentication method according to claim 1, wherein the step S1 comprises:

S11: Perform pre-emphasis, framing, and windowing on the voice data.

S12, performing Fourier transform on each window to obtain a corresponding spectrum;

S13, input the spectrum into a mel filter to output a mega spectrum;

S14, performing cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and composing a corresponding voiceprint feature vector based on the Mel frequency cepstral coefficient MFCC.
The voiceprint recognition based authentication method according to claim 4, wherein the background channel model is a Gaussian mixture model, and the step S1 comprises:

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
The method according to claim 4, wherein the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors. Obtaining a minimum spatial distance, authenticating the user based on the minimum spatial distance, and generating a verification result.
The voiceprint recognition based authentication method according to claim 1, wherein the step S3 comprises:

S31. Calculate a cosine distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user:
Identifying the vector for the standard voiceprint,
Identify the vector for the current voiceprint;

S32. If the cosine distance is less than or equal to a preset distance threshold, generate verification information.

S33. If the cosine distance is greater than a preset distance threshold, generate information that the verification fails.
The voiceprint recognition-based authentication method according to claim 7, wherein the background channel model is a Gaussian mixture model, and the step S1 comprises:

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
An electronic device, comprising: a processing device, a storage device, and a voiceprint recognition based authentication system, wherein the voiceprint recognition based authentication system is stored in the storage device, including at least one computer readable instruction The at least one computer readable instruction is executable by the processing device to:

S1, after receiving the voice data of the user who performs the authentication, acquiring the voiceprint feature of the voice data, and constructing a corresponding voiceprint feature vector based on the voiceprint feature;

S2, input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

S3. Calculate a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.
The electronic device according to claim 9, wherein the background channel model is a Gaussian mixture model, and the at least one computer readable instruction is further executable by the processing device to perform the following operations before the step S1 :

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
The electronic device according to claim 9, wherein the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors, and obtaining a minimum spatial distance, The user is authenticated based on the minimum spatial distance and a verification result is generated.
The electronic device according to claim 9, wherein the step S1 comprises:

S11: Perform pre-emphasis, framing, and windowing on the voice data.

S12, performing Fourier transform on each window to obtain a corresponding spectrum;

S13, input the spectrum into a mel filter to output a mega spectrum;

S14, performing cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and composing a corresponding voiceprint feature vector based on the Mel frequency cepstral coefficient MFCC.
The electronic device according to claim 12, wherein the background channel model is a Gaussian mixture model, and the at least one computer readable instruction is further executable by the processing device to perform the following operations before the step S1 :

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

Training the Gaussian mixture model using the voiceprint feature vector in the training set, and training After the completion, verifying the accuracy of the trained Gaussian mixture model by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
The electronic device according to claim 12, wherein the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors, and obtaining a minimum spatial distance, The user is authenticated based on the minimum spatial distance and a verification result is generated.
The electronic device according to claim 9, wherein the step S3 comprises:

S31. Calculate a cosine distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user:
Identifying the vector for the standard voiceprint,
Identify the vector for the current voiceprint;

S32. If the cosine distance is less than or equal to a preset distance threshold, generate verification information.

S33. If the cosine distance is greater than a preset distance threshold, generate information that the verification fails.
The electronic device according to claim 15, wherein the background channel model is a Gaussian mixture model, and the at least one computer readable instruction is further executable by the processing device to perform the following operations before the step S1 :

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
A computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to::

S1, after receiving the voice data of the user who performs the authentication, acquiring the voiceprint feature of the voice data, and constructing a corresponding voiceprint feature vector based on the voiceprint feature;

S2, input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data;

S3, calculating the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user The spatial distance between the users is authenticated based on the distance and a verification result is generated.
The storage medium according to claim 17, wherein said background channel model is a Gaussian mixture model, and said at least one computer readable instruction is further executable by said processing device to perform the following operations prior to said step S1 :

Obtaining a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on voiceprint features corresponding to each voice data sample;

The voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a verification set of a second ratio, wherein a sum of the first ratio and the second ratio is less than or equal to 1;

The Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;

If the accuracy is greater than the preset threshold, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the voice is added. The number of data samples is re-trained based on the increased speech data samples.
The storage medium according to claim 17, wherein the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors, and obtaining a minimum spatial distance, The user is authenticated based on the minimum spatial distance and a verification result is generated.
The storage medium of claim 17, wherein the step S1 comprises:

S11: Perform pre-emphasis, framing, and windowing on the voice data.

S12, performing Fourier transform on each window to obtain a corresponding spectrum;

S13, input the spectrum into a mel filter to output a mega spectrum;

S14, performing cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and composing a corresponding voiceprint feature vector based on the Mel frequency cepstral coefficient MFCC.