WO2019100606A1 - 电子装置、基于声纹的身份验证方法、系统及存储介质 - Google Patents

电子装置、基于声纹的身份验证方法、系统及存储介质 Download PDF

Info

Publication number
WO2019100606A1
WO2019100606A1 PCT/CN2018/076113 CN2018076113W WO2019100606A1 WO 2019100606 A1 WO2019100606 A1 WO 2019100606A1 CN 2018076113 W CN2018076113 W CN 2018076113W WO 2019100606 A1 WO2019100606 A1 WO 2019100606A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
voice data
preset
vector
data
Prior art date
Application number
PCT/CN2018/076113
Other languages
English (en)
French (fr)
Inventor
赵峰
王健宗
程宁
郑斯奇
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019100606A1 publication Critical patent/WO2019100606A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4014Identity check for transactions
    • G06Q20/40145Biometric identity checks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of communications technologies, and in particular, to an electronic device, a voiceprint-based identity verification method, a system, and a storage medium.
  • the purpose of the present application is to provide an electronic device, a voiceprint-based identity verification method, system, and storage medium, which are intended to improve the accuracy and efficiency of identity verification.
  • the present application provides an electronic device including a memory and a processor coupled to the memory, the memory storing a processing system operable on the processor, the processing The system implements the following steps when executed by the processor:
  • the framing sampling step after receiving the voice data of the target user to be authenticated, calling a predetermined convolutional neural network CNN model to framing and sampling the voice data to obtain voice sample data; extracting steps, using presets
  • the filter processes the voice sample data to extract a preset type voiceprint feature, and constructs a voiceprint feature vector corresponding to the voice data based on the preset type voiceprint feature; and constructing a step, inputting the voiceprint feature vector into advance a trained background channel model to construct a current voiceprint discrimination vector of the voice data; a verification step of calculating a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the target user, based on the The spatial distance authenticates the user and generates a verification result.
  • the present application further provides a voiceprint-based identity verification method, where the voiceprint-based identity verification method includes:
  • S1 after receiving the voice data of the target user to be authenticated, calling a predetermined convolutional neural network CNN model to frame and sample the voice data to obtain voice sample data; S2, using a preset filter The voice sample data is processed to extract a preset type voiceprint feature, and the voiceprint feature vector corresponding to the voice data is constructed based on the preset type voiceprint feature; S3, the voiceprint feature vector is input into the pre-trained background channel model a current voiceprint discrimination vector for constructing the voice data; S4, calculating a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the target user, and performing the user on the basis based on the spatial distance Authenticate and generate verification results.
  • the present application further provides a voiceprint-based identity verification system, where the voiceprint-based identity verification system includes:
  • the framing sampling module is configured to: after receiving the voice data of the target user to be authenticated, call a predetermined convolutional neural network CNN model to frame and sample the voice data to obtain voice sampling data; and extract the module,
  • the voice sample data is processed by using a preset filter to extract a preset type voiceprint feature, and the voiceprint feature vector corresponding to the voice data is constructed based on the preset type voiceprint feature; and
  • a building module is configured to a voiceprint feature vector is input to the pre-trained background channel model to construct a current voiceprint discrimination vector of the voice data; and a verification module is configured to calculate the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the target user
  • the spatial distance between the user is authenticated based on the spatial distance and a verification result is generated.
  • the present application also provides a computer readable storage medium having a processing system stored thereon, the processing system being implemented by a processor to implement the steps:
  • the framing sampling step after receiving the voice data of the target user to be authenticated, calling a predetermined convolutional neural network CNN model to framing and sampling the voice data to obtain voice sample data; extracting steps, using presets
  • the filter processes the voice sample data to extract a preset type voiceprint feature, and constructs a voiceprint feature vector corresponding to the voice data based on the preset type voiceprint feature; and constructing a step, inputting the voiceprint feature vector into advance a trained background channel model to construct a current voiceprint discrimination vector of the voice data; a verification step of calculating a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the target user, based on the The spatial distance authenticates the user and generates a verification result.
  • the present invention uses a convolutional neural network model to perform framing and sampling speech processing on a voice data based on a voiceprint, and can quickly and efficiently obtain useful information in the voice data.
  • Local data based on voice sampling data to extract voiceprint features and construct voiceprint feature vector for identity verification of target users, can improve the accuracy and efficiency of identity verification.
  • FIG. 1 is a schematic diagram of a hardware architecture of an embodiment of an electronic device according to the present application.
  • FIG. 2 is a schematic flowchart diagram of an embodiment of a voiceprint-based identity verification method according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic diagram of a hardware architecture of an embodiment of an electronic device according to the present application.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 1 may include, but is not limited to, a memory 11 communicably connected to each other through a system bus, a processor 12, and a network interface 13, and the memory 11 stores a processing system operable on the processor 12. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the memory 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM).
  • a non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
  • the readable storage medium of the memory 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the processing system in an embodiment of the present application. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing associated with data interaction or communication with the other devices.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running a processing system or the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with other devices, establish a data transmission channel and a communication connection, and receive voice data of the target user to be authenticated.
  • the processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application;
  • the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the framing sampling step after receiving the voice data of the target user to be authenticated, calling a predetermined convolutional neural network CNN model to frame and sample the voice data to obtain voice sample data;
  • the voice data is collected by the voice collection device (the voice collection device is, for example, a microphone).
  • the voice collection device When collecting voice data, you should try to prevent environmental noise and interference from voice acquisition equipment.
  • the voice collection device maintains an appropriate distance from the target user, and tries not to use a large distortion voice acquisition device.
  • the power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording the telephone.
  • the speech data can be denoised before framing and sampling to further reduce interference.
  • the collected voice data is voice data of a preset data length, or voice data greater than a preset data length.
  • the received voice data is one-dimensional voice data
  • the framing sampling step includes:
  • Framing the voice data using the framed data as a line and the intra-frame data as a column to obtain two-dimensional voice data corresponding to the voice data; using a convolution kernel of a preset specification, and based on the The two-dimensional voice data is convoluted by a preset step size, and the maximum pooled maxpooling sampling is performed on the convolved voice data according to the second preset step size to obtain the voice sample data.
  • the speech signal exhibits smoothness only in a short time
  • the framing is to divide a speech signal into N segments of short-time speech signals, and in order to avoid the loss of the continuity characteristics of the speech signals, there is a repetition between adjacent speech frames.
  • the repeating area is generally 1/2 of the frame length. After framing, each frame is treated as a smooth signal.
  • the convolution kernel of the preset specification may be a 5*5 convolution kernel, the first preset step size may be 1*1, and the second preset step size may be 2*2.
  • the voiceprint feature includes a plurality of types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc.
  • the preset type voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC) of voice sample data.
  • the preset filter is a Meyer filter.
  • the background channel model is preferably a Gaussian mixture model, and the Gaussian mixture model is used to calculate the voiceprint feature vector to obtain a corresponding current voiceprint discrimination vector (ie, i-vector).
  • the calculation process includes:
  • Loglike is a likelihood logarithmic matrix
  • E(X) is a mean matrix trained by a general background channel model
  • D(X) is a covariance matrix
  • X is a data matrix
  • X. 2 is a square of each value of the matrix.
  • loglikes i C i + E i *Cov i -1 *X i -X i T *X i *Cov i -1
  • loglikes i is the i-th row of the likelihood logarithmic matrix vector
  • C i is the i-th model constant term
  • E i is the mean of the i-th model matrix
  • Cov i is the covariance matrix for the i-th model
  • X i is the i-th frame data.
  • Extract the current voiceprint discrimination vector firstly calculate the first-order and second-order coefficients, and the first-order coefficient calculation can be obtained by summing the probability matrix:
  • Gamma i is the ith element of the first-order coefficient vector
  • loglikes ji is the j-th row of the likelihood log-valued matrix, the ith element.
  • the second-order coefficients can be obtained by multiplying the transposition of the probability matrix by the data matrix:
  • X Loglike T *feats, where X is a second-order coefficient matrix, loglike is a likelihood logarithmic matrix, and feats is a feature data matrix.
  • the primary term and the quadratic term are calculated in parallel, and then the current voiceprint discrimination vector is calculated by the primary term and the quadratic term.
  • the process of training the Gaussian mixture model includes:
  • the voiceprint feature vector is divided into a training set of a first ratio (for example, 0.75) and a verification set of a second ratio (for example, 0.25), and a sum of the first ratio and the second ratio is less than or equal to 1;
  • the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;
  • the model training ends, and the trained Gaussian mixture model is used as the background channel model, or if the accuracy is less than or equal to a preset threshold, the voice data sample is added. The number is re-trained based on the increased voice data samples.
  • the likelihood probability corresponding to the extracted D-dimensional voiceprint feature can be expressed by K Gaussian components:
  • P(x) is the probability that the speech data samples are generated by the Gaussian mixture model (mixed Gaussian model), w k is the weight of each Gaussian model, and p(x
  • K is the number of Gaussian models.
  • the parameters of the entire Gaussian mixture model can be expressed as: ⁇ w i , ⁇ i , ⁇ i ⁇ , w i is the weight of the i-th Gaussian model, ⁇ i is the mean of the i-th Gaussian model, and ⁇ i is the i-th Gaussian
  • the covariance of the model can be trained with an unsupervised EM algorithm, and the objective function uses maximum likelihood estimation, ie, the log likelihood function is maximized by selecting parameters. After the training is completed, the Gaussian mixture model weight vector, constant vector, N covariance matrix, and the mean multiplied by the covariance matrix are obtained, which is a trained Gaussian mixture model.
  • the background channel model pre-trained in this embodiment is obtained by mining and comparing a large amount of voice data.
  • This model can accurately depict the background voiceprint characteristics of the user while maximally retaining the voiceprint features of the user. And this feature can be removed at the time of identification, and the inherent characteristics of the user's voice can be extracted, which can greatly improve the accuracy and efficiency of user identity verification.
  • the verification step calculates a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the target user, performs identity verification on the user based on the spatial distance, and generates a verification result.
  • the spatial distance of the embodiment is a cosine distance
  • the cosine distance is the use of two vector clips in the vector space.
  • the cosine of the angle is used as a measure of the magnitude of the difference between two individuals.
  • the standard voiceprint discriminant vector is a voiceprint discriminant vector obtained and stored in advance, and the standard voiceprint discriminant vector carries the identifier information of the corresponding user when stored, which can accurately represent the identity of the corresponding user.
  • the stored standard voiceprint discrimination vector is obtained based on the identification information provided by the user before calculating the spatial distance.
  • the verification passes, and vice versa, the verification fails.
  • the convolutional neural network model is used to perform framed and sampled voice processing on the voice data, which can be used to quickly and efficiently obtain voice data.
  • the local data based on the voice sample data to extract the voiceprint feature and construct the voiceprint feature vector for the identity verification of the target user, can improve the accuracy and efficiency of the identity verification; in addition, the embodiment fully utilizes the vocal correlation with the vocal
  • the voiceprint feature which does not require text restrictions, provides greater flexibility in the process of identification and verification.
  • the extracting step includes:
  • a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and a corresponding voiceprint feature vector is formed based on the Mel frequency cepstral coefficient MFCC.
  • the pre-emphasis processing is actually a high-pass filtering process, filtering out low-frequency data, so that the high-frequency characteristics in the voice data are more prominent.
  • the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei.
  • Frequency cepstral coefficient MFCC is the voiceprint feature of the framed speech sample data, and the Mel frequency cepstrum coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the voice sample data.
  • the voice sample data MFCC of the voice sample data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal logarithmic cepstrum. The accuracy of the authentication.
  • the verifying step specifically includes:
  • Calculating the cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the target user Identifying the vector for the standard voiceprint, And identifying the vector for the current voiceprint; if the cosine distance is less than or equal to the preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.
  • the identifier information of the target user may be carried, and when the identity of the user is verified, the corresponding standard voiceprint discrimination vector is obtained according to the identifier information of the current voiceprint authentication vector.
  • the cosine distance between the current voiceprint discrimination vector and the matched standard voiceprint discrimination vector is calculated, and the cosine distance is used to verify the identity of the target user, thereby improving the accuracy of the identity verification.
  • FIG. 2 is a schematic flowchart of an embodiment of a voiceprint-based identity verification method according to an embodiment of the present disclosure.
  • the voiceprint-based identity verification method includes the following steps:
  • Step S1 after receiving the voice data of the target user to be authenticated, calling a predetermined convolutional neural network CNN model to frame and sample the voice data to obtain voice sample data;
  • the voice data is collected by the voice collection device (the voice collection device is, for example, a microphone).
  • the voice collection device When collecting voice data, you should try to prevent environmental noise and interference from voice acquisition equipment.
  • the voice collection device maintains an appropriate distance from the target user, and tries not to use a large distortion voice acquisition device.
  • the power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording the telephone.
  • the speech data can be denoised before framing and sampling to further reduce interference.
  • the collected voice data is voice data of a preset data length, or voice data greater than a preset data length.
  • the received voice data is one-dimensional voice data
  • the framing sampling step includes:
  • Framing the voice data using the framed data as a line and the intra-frame data as a column to obtain two-dimensional voice data corresponding to the voice data; using a convolution kernel of a preset specification, and based on the The two-dimensional voice data is convoluted by a preset step size, and the maximum pooled maxpooling sampling is performed on the convolved voice data according to the second preset step size to obtain the voice sample data.
  • the speech signal exhibits smoothness only in a short time
  • the framing is to divide a speech signal into N segments of short-time speech signals, and in order to avoid the loss of the continuity characteristics of the speech signals, there is a repetition between adjacent speech frames.
  • the repeating area is generally 1/2 of the frame length. After framing, each frame is treated as a smooth signal.
  • the convolution kernel of the preset specification may be a 5*5 convolution kernel, the first preset step size may be 1*1, and the second preset step size may be 2*2.
  • Step S2 processing the voice sample data by using a preset filter to extract a preset type voiceprint feature, and constructing a voiceprint feature vector corresponding to the voice data based on the preset type voiceprint feature;
  • the voiceprint feature includes a plurality of types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc.
  • the preset type voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC) of voice sample data.
  • the preset filter is a Meyer filter.
  • Step S3 inputting the voiceprint feature vector into the pre-trained background channel model to construct a current voiceprint discrimination vector of the voice data
  • the background channel model is preferably a Gaussian mixture model, and the Gaussian mixture model is used to calculate the voiceprint feature vector to obtain a corresponding current voiceprint discrimination vector (ie, i-vector).
  • the calculation process includes:
  • Loglike is a likelihood logarithmic matrix
  • E(X) is a mean matrix trained by a general background channel model
  • D(X) is a covariance matrix
  • X is a data matrix
  • X. 2 is a square of each value of the matrix.
  • loglikes i C i + E i *Cov i -1 *X i -X i T *X i *Cov i -1
  • loglikes i is the i-th row of the likelihood logarithmic matrix vector
  • C i is the i-th model constant term
  • E i is the mean of the i-th model matrix
  • Cov i is the covariance matrix for the i-th model
  • X i is the i-th frame data.
  • Extract the current voiceprint discrimination vector firstly calculate the first-order and second-order coefficients, and the first-order coefficient calculation can be obtained by summing the probability matrix:
  • Gamma i is the ith element of the first-order coefficient vector
  • loglikes ji is the j-th row of the likelihood log-valued matrix, the ith element.
  • the second-order coefficients can be obtained by multiplying the transposition of the probability matrix by the data matrix:
  • X Loglike T *feats, where X is a second-order coefficient matrix, loglike is a likelihood logarithmic matrix, and feats is a feature data matrix.
  • the primary term and the quadratic term are calculated in parallel, and then the current voiceprint discrimination vector is calculated by the primary term and the quadratic term.
  • the process of training the Gaussian mixture model includes:
  • the voiceprint feature vector is divided into a training set of a first ratio (for example, 0.75) and a verification set of a second ratio (for example, 0.25), and a sum of the first ratio and the second ratio is less than or equal to 1;
  • the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the accuracy of the trained Gaussian mixture model is verified by using the verification set;
  • the model training ends, and the trained Gaussian mixture model is used as the background channel model, or if the accuracy is less than or equal to a preset threshold, the voice data sample is added. The number is re-trained based on the increased voice data samples.
  • the likelihood probability corresponding to the extracted D-dimensional voiceprint feature can be expressed by K Gaussian components:
  • P(x) is the probability that the speech data samples are generated by the Gaussian mixture model (mixed Gaussian model), w k is the weight of each Gaussian model, and p(x
  • K is the number of Gaussian models.
  • the parameters of the entire Gaussian mixture model can be expressed as: ⁇ w i , ⁇ i , ⁇ i ⁇ , w i is the weight of the i-th Gaussian model, ⁇ i is the mean of the i-th Gaussian model, and ⁇ i is the i-th Gaussian
  • the covariance of the model can be trained with an unsupervised EM algorithm, and the objective function uses maximum likelihood estimation, ie, the log likelihood function is maximized by selecting parameters. After the training is completed, the Gaussian mixture model weight vector, constant vector, N covariance matrix, and the mean multiplied by the covariance matrix are obtained, which is a trained Gaussian mixture model.
  • the background channel model pre-trained in this embodiment is obtained by mining and comparing a large amount of voice data.
  • This model can accurately depict the background voiceprint characteristics of the user while maximally retaining the voiceprint features of the user. And this feature can be removed at the time of identification, and the inherent characteristics of the user's voice can be extracted, which can greatly improve the accuracy and efficiency of user identity verification.
  • Step S4 Calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the target user, perform identity verification on the user based on the spatial distance, and generate a verification result.
  • the spatial distance of the embodiment is a cosine distance
  • the cosine distance is the use of two vector clips in the vector space.
  • the cosine of the angle is used as a measure of the magnitude of the difference between two individuals.
  • the standard voiceprint discriminant vector is a voiceprint discriminant vector obtained and stored in advance, and the standard voiceprint discriminant vector carries the identifier information of the corresponding user when stored, which can accurately represent the identity of the corresponding user.
  • the stored standard voiceprint discrimination vector is obtained based on the identification information provided by the user before calculating the spatial distance.
  • the verification passes, and vice versa, the verification fails.
  • the convolutional neural network model is used to perform framed and sampled voice processing on the voice data, which can be used to quickly and efficiently obtain voice data.
  • the local data based on the voice sample data to extract the voiceprint feature and construct the voiceprint feature vector for the identity verification of the target user, can improve the accuracy and efficiency of the identity verification; in addition, the embodiment fully utilizes the vocal correlation with the vocal
  • the voiceprint feature which does not require text restrictions, provides greater flexibility in the process of identification and verification.
  • the foregoing step S2 includes:
  • a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and a corresponding voiceprint feature vector is formed based on the Mel frequency cepstral coefficient MFCC.
  • the pre-emphasis processing is actually a high-pass filtering process, filtering out low-frequency data, so that the high-frequency characteristics in the voice data are more prominent.
  • the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei.
  • Frequency cepstral coefficient MFCC is the voiceprint feature of the framed speech sample data, and the Mel frequency cepstrum coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the voice sample data.
  • the voice sample data MFCC of the voice sample data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal logarithmic cepstrum. The accuracy of the authentication.
  • the step S4 specifically includes:
  • Calculating the cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the target user Identifying the vector for the standard voiceprint, And identifying the vector for the current voiceprint; if the cosine distance is less than or equal to the preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.
  • the identifier information of the target user may be carried, and when the identity of the user is verified, the corresponding standard voiceprint discrimination vector is obtained according to the identifier information of the current voiceprint authentication vector.
  • the cosine distance between the current voiceprint discrimination vector and the matched standard voiceprint discrimination vector is calculated, and the cosine distance is used to verify the identity of the target user, thereby improving the accuracy of the identity verification.
  • the present application also provides a computer readable storage medium having stored thereon a processing system that, when executed by a processor, implements the steps of the voiceprint based authentication method described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Abstract

一种电子装置、基于声纹的身份验证方法、系统及存储介质,该方法包括:在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据(S1);利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量(S2);将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量(S3);计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果(S4)。该方法能够提高身份验证的准确性及效率。

Description

电子装置、基于声纹的身份验证方法、系统及存储介质
优先权申明
本申请基于巴黎公约申明享有2017年11月21日递交的申请号为CN201711161344.0、名称为“电子装置、基于声纹的身份验证方法及存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种电子装置、基于声纹的身份验证方法、系统及存储介质。
背景技术
目前,很多大型金融公司的业务范围涉及保险、银行、投资等多个业务范畴,而每个业务范畴通常都需要同客户进行沟通,因此,对客户的身份验证也就成为保证业务安全的重要组成部分。为了满足业务的实时性需求,目前这类金融公司通常采用人工方式对客户的身份进行分析验证,但是由于客户群体庞大,单一依靠人工进行判别分析不仅费时费力、容易出错,而且也会极大地增加业务成本;另外,有些金融公司尝试采用语音自动识别的方式自动对用户的身份进行鉴别,然而,这类现有的语音自动识别方式的准确率低,有待改进。因此,如何提供准确性高的语音自动识别方案已经成为一个亟待解决的技术问题。
发明内容
本申请的目的在于提供一种电子装置、基于声纹的身份验证方法、系统及存储介质,旨在提高身份验证的准确性及效率。
为实现上述目的,本申请提供一种电子装置,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:
分帧采样步骤,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;提取步骤,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;构建步骤,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;验证步骤,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
为实现上述目的,本申请还提供一种基于声纹的身份验证方法,所述基于声纹的身份验证方法包括:
S1,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;S2,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;S3,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;S4,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
为实现上述目的,本申请还提供一种基于声纹的身份验证系统,所述基于声纹的身份验证系统包括:
分帧采样模块,用于在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;提取模块,用于利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;构建模块,用于将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;验证模块,用于计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存 储有处理系统,所述处理系统被处理器执行时实现步骤:
分帧采样步骤,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;提取步骤,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;构建步骤,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;验证步骤,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
本申请的有益效果是:本申请在基于声纹对目标用户进行身份验证时,采用卷积神经网络模型对语音数据进行分帧和采样的语音处理,能够快速、有效地获取语音数据中有用的局部数据,基于语音采样数据提取声纹特征并构建声纹特征向量进行目标用户的身份验证,能够提高身份验证的准确性及效率。
附图说明
图1为本申请电子装置一实施例的硬件架构的示意图;
图2为本申请基于声纹的身份验证方法一实施例的流程示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必 须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,图1为本申请电子装置一实施例的硬件架构的示意图。电子装置1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述电子装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。
在本实施例中,电子装置1可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13,存储器11存储有可在处理器12上运行的处理系统。需要指出的是,图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,存储器11包括内存及至少一种类型的可读存储介质。内存为电子装置1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置1的内部存储单元,例如该电子装置1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储器11的可读存储介质通常用于存储安装于电子装置1的操作系统和各类应用软件,例如本申请一实施例中的处理系统的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理 器12通常用于控制所述电子装置1的总体操作,例如执行与所述其他设备进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行处理系统等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述电子装置1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将电子装置1与其他设备相连,建立数据传输通道和通信连接,以接收待进行身份验证的目标用户的语音数据。
所述处理系统存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器器12执行,以实现本申请各实施例的方法;以及,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。
在一实施例中,上述处理系统被所述处理器12执行时实现如下步骤:
分帧采样步骤,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
本实施例中,语音数据由语音采集设备采集得到(语音采集设备例如为麦克风)。在采集语音数据时,应尽量防止环境噪声和语音采集设备的干扰。语音采集设备与目标用户保持适当距离,且尽量不用失真大的语音采集设备,电源优选使用市电,并保持电流稳定;在进行电话录音时应使用传感器。在分帧和采样之前,可以对语音数据进行去噪音处理,以进一步减少干扰。为了能够提取得到语音数据的声纹特征,所采集的语音数据为预设数据长度的语音数据,或者为大于预设数据长度的语音数据。
在一优选的实施例中,接收到的语音数据为一维语音数据,分帧采样步骤,具体包括:
对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;对卷积后的语音数据按照第二预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
其中,语音信号只在较短时间内呈现平稳性,分帧是将一段语音信号分成N段短时间的语音信号,并且为了避免丢失语音信号的连续性特征,相邻语音帧之间有一段重复区域,重复区域一般为帧长的1/2。在分帧后,每一帧都当成平稳信号来处理。
其中,预设规格的卷积核可以为5*5的卷积核,第一预设步长可以为1*1,第二预设步长可以为2*2。
提取步骤,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例预设类型声纹特征优选为语音采样数据的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),预设滤波器为梅尔滤波器。在构建对应的声纹特征向量时,将语音采样数据的声纹特征组成特征数据矩阵,该特征数据矩阵即为语音采样数据的声纹特征向量。
构建步骤,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
本实施例中,该背景信道模型优选为高斯混合模型,利用该高斯混合模型来计算声纹特征向量,得出对应的当前声纹鉴别向量(即i-vector)。
具体地,该计算过程包括:
1)、选择高斯模型:首先,利用通用背景信道模型中的参数来计算每帧数据在不同高斯模型的似然对数值,通过对似然对数值矩阵每列并行排序,选取前N个高斯模型,最终获得一每帧数据在混合高斯模型中数值的矩阵:
Loglike=E(X)*D(X) -1*X T-0.5*D(X) -1*(X. 2) T
其中,Loglike为似然对数值矩阵,E(X)为通用背景信道模型训练出来的均值矩阵,D(X)为协方差矩阵,X为数据矩阵,X. 2为矩阵每个值取平方。
其中,似然对数值计算公式:loglikes i=C i+E i*Cov i -1*X i-X i T*X i*Cov i -1,loglikes i为似然对数值矩阵的第i行向量,C i为第i个模型的常数项,E i为第i个模型的均值矩阵,Cov i为第i个模型的协方差矩阵,X i为第i帧数据。
2)、计算后验概率:将每帧数据X进行X*XT计算,得到一个对称矩 阵,可简化为下三角矩阵,并将元素按顺序排列为1行,变成一个N帧乘以该下三角矩阵个数纬度的一个向量进行计算,将所有帧的该向量组合成新的数据矩阵,同时将通用背景模型中计算概率的协方差矩阵,每个矩阵也简化为下三角矩阵,变成与新数据矩阵类似的矩阵,在通过通用背景信道模型中的均值矩阵和协方差矩阵算出每帧数据的在该选择的高斯模型下的似然对数值,然后进行Softmax回归,最后进行归一化操作,得到每帧在混合高斯模型后验概率分布,将每帧的概率分布向量组成概率矩阵。
3)、提取当前声纹鉴别向量:首先进行一阶,二阶系数的计算,一阶系数计算可以通过概率矩阵列求和得到:
Figure PCTCN2018076113-appb-000001
其中,Gamma i为一阶系数向量的第i个元素,loglikes ji为似然对数值矩阵的第j行,第i个元素。
二阶系数可以通过概率矩阵的转置乘以数据矩阵获得:
X=Loglike T*feats,其中,X为二阶系数矩阵,loglike为似然对数值矩阵,feats为特征数据矩阵。
在计算得到一阶,二阶系数以后,并行计算一次项和二次项,然后通过一次项和二次项计算当前声纹鉴别向量。
优选地,训练高斯混合模型的过程包括:
获取预设数量(例如十万个)的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
将该声纹特征向量分为第一比例(例如0.75)的训练集和第二比例(例如0.25)的验证集,所述第一比例及第二比例的和小于等于1;
利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模型作为前述的背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
其中,在利用训练集中的声纹特征向量对高斯混合模型进行训练时,抽 取出来的D维声纹特征对应的似然概率可用K个高斯分量表示为:
Figure PCTCN2018076113-appb-000002
其中,P(x)为语音数据样本由高斯混合模型生成的概率(混合高斯模型),w k为每个高斯模型的权重,p(x|k)为样本由第k个高斯模型生成的概率,K为高斯模型数量。
整个高斯混合模型的参数可以表示为:{w iii},w i为第i个高斯模型的权重,μ i为第i个高斯模型的均值,∑ i为第i个高斯模型的协方差。训练该高斯混合模型可以用非监督的EM算法,目标函数采用最大似然估计,即通过选择参数使对数似然函数最大。训练完成后,得到高斯混合模型的权重向量、常数向量、N个协方差矩阵、均值乘以协方差的矩阵等,即为一个训练好的高斯混合模型。
本实施例预先训练的背景信道模型为通过对大量语音数据的挖掘与比对训练得到,这一模型可以在最大限度保留用户的声纹特征的同时,精确刻画用户说话时的背景声纹特征,并能够在识别时将这一特征去除,而提取用户声音的固有特征,能够较大地提高用户身份验证的准确率及效率。
验证步骤,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
本实施例中,向量与向量之间的距离有多种,包括余弦距离及欧氏距离等等,优选地,本实施例的空间距离为余弦距离,余弦距离为利用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。
其中,标准声纹鉴别向量为预先获得并存储的声纹鉴别向量,标准声纹鉴别向量在存储时携带其对应的用户的标识信息,其能够准确代表对应的用户的身份。在计算空间距离前,根据用户提供的标识信息获得存储的标准声纹鉴别向量。
其中,在计算得到的空间距离小于等于预设距离阈值时,验证通过,反之,则验证失败。
与现有技术相比,本实施例在基于声纹对目标用户进行身份验证时,采用卷积神经网络模型对语音数据进行分帧和采样的语音处理,能够快速、有 效地获取语音数据中有用的局部数据,基于语音采样数据提取声纹特征并构建声纹特征向量进行目标用户的身份验证,能够提高身份验证的准确性及效率;此外,本实施例充分利用了人声中与声道相关的声纹特征,这种声纹特征并不需要对文本加以限制,因而在进行识别与验证的过程中有较大的灵活性。
在一优选的实施例中,在上述图1的实施例的基础上,上述的提取步骤包括:
对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
本实施例中,预加重处理实际是高通滤波处理,滤除低频数据,使得语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ -1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音采样数据在分帧之后在一定程度上背离原始语音,因此,需要对语音采样数据进行加窗处理。
本实施例中,在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语音采样数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为语音采样数据的声纹特征向量。
本实施例取语音采样数据梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。
在一优选的实施例中,在上述图1的实施例的基础上,所述验证步骤,具体包括:
计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
Figure PCTCN2018076113-appb-000003
为所述标准声纹鉴别向量,
Figure PCTCN2018076113-appb-000004
为当前声纹鉴别向量;若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
本实施例中,在存储目标用户的标准声纹鉴别向量时可以携带目标用户的标识信息,在验证用户的身份时,根据当前声纹鉴别向量的标识信息匹配得到对应的标准声纹鉴别向量,并计算当前声纹鉴别向量与匹配得到的标准声纹鉴别向量之间的余弦距离,以余弦距离来验证目标用户的身份,提高身份验证的准确性。
如图2所示,图2为本申请基于声纹的身份验证方法一实施例的流程示意图,该基于声纹的身份验证方法包括以下步骤:
步骤S1,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
本实施例中,语音数据由语音采集设备采集得到(语音采集设备例如为麦克风)。在采集语音数据时,应尽量防止环境噪声和语音采集设备的干扰。语音采集设备与目标用户保持适当距离,且尽量不用失真大的语音采集设备,电源优选使用市电,并保持电流稳定;在进行电话录音时应使用传感器。在分帧和采样之前,可以对语音数据进行去噪音处理,以进一步减少干扰。为了能够提取得到语音数据的声纹特征,所采集的语音数据为预设数据长度的语音数据,或者为大于预设数据长度的语音数据。
在一优选的实施例中,接收到的语音数据为一维语音数据,分帧采样步骤,具体包括:
对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;对卷积后的语音数据按照第二 预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
其中,语音信号只在较短时间内呈现平稳性,分帧是将一段语音信号分成N段短时间的语音信号,并且为了避免丢失语音信号的连续性特征,相邻语音帧之间有一段重复区域,重复区域一般为帧长的1/2。在分帧后,每一帧都当成平稳信号来处理。
其中,预设规格的卷积核可以为5*5的卷积核,第一预设步长可以为1*1,第二预设步长可以为2*2。
步骤S2,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例预设类型声纹特征优选为语音采样数据的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),预设滤波器为梅尔滤波器。在构建对应的声纹特征向量时,将语音采样数据的声纹特征组成特征数据矩阵,该特征数据矩阵即为语音采样数据的声纹特征向量。
步骤S3,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
本实施例中,该背景信道模型优选为高斯混合模型,利用该高斯混合模型来计算声纹特征向量,得出对应的当前声纹鉴别向量(即i-vector)。
具体地,该计算过程包括:
1)、选择高斯模型:首先,利用通用背景信道模型中的参数来计算每帧数据在不同高斯模型的似然对数值,通过对似然对数值矩阵每列并行排序,选取前N个高斯模型,最终获得一每帧数据在混合高斯模型中数值的矩阵:
Loglike=E(X)*D(X) -1*X T-0.5*D(X) -1*(X. 2) T
其中,Loglike为似然对数值矩阵,E(X)为通用背景信道模型训练出来的均值矩阵,D(X)为协方差矩阵,X为数据矩阵,X. 2为矩阵每个值取平方。
其中,似然对数值计算公式:loglikes i=C i+E i*Cov i -1*X i-X i T*X i*Cov i -1,loglikes i为似然对数值矩阵的第i行向量,C i为第i个模型的常数项,E i为第 i个模型的均值矩阵,Cov i为第i个模型的协方差矩阵,X i为第i帧数据。
2)、计算后验概率:将每帧数据X进行X*XT计算,得到一个对称矩阵,可简化为下三角矩阵,并将元素按顺序排列为1行,变成一个N帧乘以该下三角矩阵个数纬度的一个向量进行计算,将所有帧的该向量组合成新的数据矩阵,同时将通用背景模型中计算概率的协方差矩阵,每个矩阵也简化为下三角矩阵,变成与新数据矩阵类似的矩阵,在通过通用背景信道模型中的均值矩阵和协方差矩阵算出每帧数据的在该选择的高斯模型下的似然对数值,然后进行Softmax回归,最后进行归一化操作,得到每帧在混合高斯模型后验概率分布,将每帧的概率分布向量组成概率矩阵。
3)、提取当前声纹鉴别向量:首先进行一阶,二阶系数的计算,一阶系数计算可以通过概率矩阵列求和得到:
Figure PCTCN2018076113-appb-000005
其中,Gamma i为一阶系数向量的第i个元素,loglikes ji为似然对数值矩阵的第j行,第i个元素。
二阶系数可以通过概率矩阵的转置乘以数据矩阵获得:
X=Loglike T*feats,其中,X为二阶系数矩阵,loglike为似然对数值矩阵,feats为特征数据矩阵。
在计算得到一阶,二阶系数以后,并行计算一次项和二次项,然后通过一次项和二次项计算当前声纹鉴别向量。
优选地,训练高斯混合模型的过程包括:
获取预设数量(例如十万个)的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
将该声纹特征向量分为第一比例(例如0.75)的训练集和第二比例(例如0.25)的验证集,所述第一比例及第二比例的和小于等于1;
利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模 型作为前述的背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
其中,在利用训练集中的声纹特征向量对高斯混合模型进行训练时,抽取出来的D维声纹特征对应的似然概率可用K个高斯分量表示为:
Figure PCTCN2018076113-appb-000006
其中,P(x)为语音数据样本由高斯混合模型生成的概率(混合高斯模型),w k为每个高斯模型的权重,p(x|k)为样本由第k个高斯模型生成的概率,K为高斯模型数量。
整个高斯混合模型的参数可以表示为:{w iii},w i为第i个高斯模型的权重,μ i为第i个高斯模型的均值,∑ i为第i个高斯模型的协方差。训练该高斯混合模型可以用非监督的EM算法,目标函数采用最大似然估计,即通过选择参数使对数似然函数最大。训练完成后,得到高斯混合模型的权重向量、常数向量、N个协方差矩阵、均值乘以协方差的矩阵等,即为一个训练好的高斯混合模型。
本实施例预先训练的背景信道模型为通过对大量语音数据的挖掘与比对训练得到,这一模型可以在最大限度保留用户的声纹特征的同时,精确刻画用户说话时的背景声纹特征,并能够在识别时将这一特征去除,而提取用户声音的固有特征,能够较大地提高用户身份验证的准确率及效率。
步骤S4,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
本实施例中,向量与向量之间的距离有多种,包括余弦距离及欧氏距离等等,优选地,本实施例的空间距离为余弦距离,余弦距离为利用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。
其中,标准声纹鉴别向量为预先获得并存储的声纹鉴别向量,标准声纹鉴别向量在存储时携带其对应的用户的标识信息,其能够准确代表对应的用户的身份。在计算空间距离前,根据用户提供的标识信息获得存储的标准声纹鉴别向量。
其中,在计算得到的空间距离小于等于预设距离阈值时,验证通过,反 之,则验证失败。
与现有技术相比,本实施例在基于声纹对目标用户进行身份验证时,采用卷积神经网络模型对语音数据进行分帧和采样的语音处理,能够快速、有效地获取语音数据中有用的局部数据,基于语音采样数据提取声纹特征并构建声纹特征向量进行目标用户的身份验证,能够提高身份验证的准确性及效率;此外,本实施例充分利用了人声中与声道相关的声纹特征,这种声纹特征并不需要对文本加以限制,因而在进行识别与验证的过程中有较大的灵活性。
在一优选的实施例中,在上述图2的实施例的基础上,上述的步骤S2包括:
对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
本实施例中,预加重处理实际是高通滤波处理,滤除低频数据,使得语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ -1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音采样数据在分帧之后在一定程度上背离原始语音,因此,需要对语音采样数据进行加窗处理。
本实施例中,在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语音采样数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为语音采样数据的声纹特征向量。
本实施例取语音采样数据梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。
在一优选的实施例中,在上述图2的实施例的基础上,所述步骤S4, 具体包括:
计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
Figure PCTCN2018076113-appb-000007
为所述标准声纹鉴别向量,
Figure PCTCN2018076113-appb-000008
为当前声纹鉴别向量;若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
本实施例中,在存储目标用户的标准声纹鉴别向量时可以携带目标用户的标识信息,在验证用户的身份时,根据当前声纹鉴别向量的标识信息匹配得到对应的标准声纹鉴别向量,并计算当前声纹鉴别向量与匹配得到的标准声纹鉴别向量之间的余弦距离,以余弦距离来验证目标用户的身份,提高身份验证的准确性。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现上述的基于声纹的身份验证方法的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:
    分帧采样步骤,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
    提取步骤,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
    构建步骤,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
    验证步骤,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
  2. 根据权利要求1所述的电子装置,其特征在于,所述分帧采样步骤,具体包括:
    对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;
    采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;
    对卷积后的语音数据按照第二预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
  3. 根据权利要求1或2所述的电子装置,其特征在于,所述提取步骤,具体包括:
    对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述 梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
  4. 根据权利要求1或2所述的电子装置,其特征在于,所述验证步骤,具体包括:
    计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
    Figure PCTCN2018076113-appb-100001
    Figure PCTCN2018076113-appb-100002
    为所述标准声纹鉴别向量,
    Figure PCTCN2018076113-appb-100003
    为当前声纹鉴别向量;
    若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;
    若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
  5. 根据权利要求1或2所述的电子装置,其特征在于,所述背景信道模型为高斯混合模型,所述处理系统被所述处理器执行时实现如下步骤:
    获取预设数量的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
    将该声纹特征向量分为第一比例的训练集和第二比例的验证集,所述第一比例及第二比例的和小于等于1;
    利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
    若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模型作为所述背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
  6. 一种基于声纹的身份验证方法,其特征在于,所述基于声纹的身份验证方法包括:
    S1,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
    S2,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
    S3,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
    S4,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
  7. 根据权利要求6所述的基于声纹的身份验证方法,其特征在于,所述步骤S1包括:
    对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;
    采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;
    对卷积后的语音数据按照第二预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
  8. 根据权利要求6或7所述的基于声纹的身份验证方法,其特征在于,所述步骤S2包括:
    对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
  9. 根据权利要求6或7所述的基于声纹的身份验证方法,其特征在于,所述步骤S4包括:
    计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
    Figure PCTCN2018076113-appb-100004
    Figure PCTCN2018076113-appb-100005
    为所述标准声纹鉴别向量,
    Figure PCTCN2018076113-appb-100006
    为当前声纹鉴别向量;
    若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;
    若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
  10. 根据权利要求6或7所述的基于声纹的身份验证方法,其特征在于, 所述背景信道模型为高斯混合模型,所述步骤S3之前包括:
    获取预设数量的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
    将该声纹特征向量分为第一比例的训练集和第二比例的验证集,所述第一比例及第二比例的和小于等于1;
    利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
    若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模型作为所述背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
  11. 一种基于声纹的身份验证系统,其特征在于,所述基于声纹的身份验证系统包括:
    分帧采样模块,用于在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
    提取模块,用于利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
    构建模块,用于将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
    验证模块,用于计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
  12. 根据权利要求11所述的基于声纹的身份验证系统,其特征在于,所述分帧采样模块,具体用于对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;对卷积 后的语音数据按照第二预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
  13. 根据权利要求11或12所述的基于声纹的身份验证系统,其特征在于,所述提取模块,具体用于对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
  14. 根据权利要求11或12所述的基于声纹的身份验证系统,其特征在于,所述验证模块,具体用于:
    计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
    Figure PCTCN2018076113-appb-100007
    Figure PCTCN2018076113-appb-100008
    为所述标准声纹鉴别向量,
    Figure PCTCN2018076113-appb-100009
    为当前声纹鉴别向量;若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
  15. 根据权利要求11或12所述的基于声纹的身份验证系统,其特征在于,所述背景信道模型为高斯混合模型,还包括:
    获取模块,用于获取预设数量的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
    划分模块,用于将该声纹特征向量分为第一比例的训练集和第二比例的验证集,所述第一比例及第二比例的和小于等于1;
    训练模块,用于利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
    处理模块,用于若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模型作为所述背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现步骤:
    分帧采样步骤,在接收到待进行身份验证的目标用户的语音数据后,调用预定的卷积神经网络CNN模型对该语音数据进行分帧和采样,得到语音采样数据;
    提取步骤,利用预设滤波器对该语音采样数据进行处理以提取预设类型声纹特征,并基于该预设类型声纹特征构建所述语音数据对应的声纹特征向量;
    构建步骤,将该声纹特征向量输入预先训练的背景信道模型,以构建出所述语音数据的当前声纹鉴别向量;
    验证步骤,计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的空间距离,基于该空间距离对该用户进行身份验证,并生成验证结果。
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述分帧采样步骤,具体包括:
    对该语音数据进行分帧,将分帧后的语音数据以帧为行,以帧内数据为列,得到该语音数据对应的二维语音数据;
    采用预设规格的卷积核,并基于第一预设步长,对该二维语音数据进行卷积;
    对卷积后的语音数据按照第二预设步长进行最大池化maxpooling采样,得到所述语音采样数据。
  18. 根据权利要求16或17所述的计算机可读存储介质,其特征在于,所述提取步骤,具体包括:
    对所述语音采样数据进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成对应的声纹特征向量。
  19. 根据权利要求16或17所述的计算机可读存储介质,其特征在于,所 述验证步骤,具体包括:
    计算该当前声纹鉴别向量与预存的该目标用户的标准声纹鉴别向量之间的余弦距离:
    Figure PCTCN2018076113-appb-100010
    Figure PCTCN2018076113-appb-100011
    为所述标准声纹鉴别向量,
    Figure PCTCN2018076113-appb-100012
    为当前声纹鉴别向量;
    若所述余弦距离小于或者等于预设的距离阈值,则生成验证通过的信息;
    若所述余弦距离大于预设的距离阈值,则生成验证不通过的信息。
  20. 根据权利要求16或17所述的计算机可读存储介质,其特征在于,所述背景信道模型为高斯混合模型,所述处理系统被所述处理器执行时实现如下步骤:
    获取预设数量的语音数据样本,对该语音数据样本进行处理得到预设类型声纹特征,并基于各语音数据样本对应的声纹特征构建对应的声纹特征向量;
    将该声纹特征向量分为第一比例的训练集和第二比例的验证集,所述第一比例及第二比例的和小于等于1;
    利用所述训练集中的声纹特征向量对高斯混合模型进行训练,并在训练完成后,利用所述验证集对训练后的高斯混合模型的准确率进行验证;
    若所述准确率大于预设阈值,则模型训练结束,以训练后的高斯混合模型作为所述背景信道模型,或者,若所述准确率小于等于预设阈值,则增加所述语音数据样本的数量,并基于增加后的语音数据样本重新进行训练。
PCT/CN2018/076113 2017-11-21 2018-02-10 电子装置、基于声纹的身份验证方法、系统及存储介质 WO2019100606A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711161344.0 2017-11-21
CN201711161344.0A CN107993071A (zh) 2017-11-21 2017-11-21 电子装置、基于声纹的身份验证方法及存储介质

Publications (1)

Publication Number Publication Date
WO2019100606A1 true WO2019100606A1 (zh) 2019-05-31

Family

ID=62031709

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076113 WO2019100606A1 (zh) 2017-11-21 2018-02-10 电子装置、基于声纹的身份验证方法、系统及存储介质

Country Status (2)

Country Link
CN (1) CN107993071A (zh)
WO (1) WO2019100606A1 (zh)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108806696B (zh) * 2018-05-08 2020-06-05 平安科技(深圳)有限公司 建立声纹模型的方法、装置、计算机设备和存储介质
CN108648759A (zh) * 2018-05-14 2018-10-12 华南理工大学 一种文本无关的声纹识别方法
CN108650266B (zh) * 2018-05-14 2020-02-18 平安科技(深圳)有限公司 服务器、声纹验证的方法及存储介质
CN108922543B (zh) * 2018-06-11 2022-08-16 平安科技(深圳)有限公司 模型库建立方法、语音识别方法、装置、设备及介质
CN109257362A (zh) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备以及存储介质
CN110634492B (zh) * 2019-06-13 2023-08-25 中信银行股份有限公司 登录验证方法、装置、电子设备及计算机可读存储介质
CN110265037B (zh) * 2019-06-13 2022-09-30 中信银行股份有限公司 身份验证方法、装置、电子设备及计算机可读存储介质
CN110556126B (zh) * 2019-09-16 2024-01-05 平安科技(深圳)有限公司 语音识别方法、装置以及计算机设备
CN110782879B (zh) * 2019-09-18 2023-07-07 平安科技(深圳)有限公司 基于样本量的声纹聚类方法、装置、设备及存储介质
CN113177816A (zh) * 2020-01-08 2021-07-27 阿里巴巴集团控股有限公司 一种信息处理方法及装置
CN111477235B (zh) * 2020-04-15 2023-05-05 厦门快商通科技股份有限公司 一种声纹采集方法和装置以及设备
CN111524525B (zh) * 2020-04-28 2023-06-16 平安科技(深圳)有限公司 原始语音的声纹识别方法、装置、设备及存储介质
CN111862933A (zh) * 2020-07-20 2020-10-30 北京字节跳动网络技术有限公司 用于生成合成语音的方法、装置、设备和介质
CN112331217B (zh) * 2020-11-02 2023-09-12 泰康保险集团股份有限公司 声纹识别方法和装置、存储介质、电子设备
CN112669820B (zh) * 2020-12-16 2023-08-04 平安科技(深圳)有限公司 基于语音识别的考试作弊识别方法、装置及计算机设备
CN114780787A (zh) * 2022-04-01 2022-07-22 杭州半云科技有限公司 声纹检索方法、身份验证方法、身份注册方法和装置
CN115086045B (zh) * 2022-06-17 2023-05-19 海南大学 基于声纹伪造检测的数据安全防护方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205606A (zh) * 2016-08-15 2016-12-07 南京邮电大学 一种基于语音识别的动态定位监控方法及系统
CN106847309A (zh) * 2017-01-09 2017-06-13 华南理工大学 一种语音情感识别方法
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130B (zh) * 2006-12-07 2011-05-18 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101923855A (zh) * 2009-06-17 2010-12-22 复旦大学 文本无关的声纹识别系统
CN101894566A (zh) * 2010-07-23 2010-11-24 北京理工大学 一种基于共振峰频率的汉语普通话复韵母可视化方法
CN103310273A (zh) * 2013-06-26 2013-09-18 南京邮电大学 基于diva模型的带声调的汉语元音发音方法
CN106682574A (zh) * 2016-11-18 2017-05-17 哈尔滨工程大学 一维深度卷积网络的水下多目标识别方法
CN106847302B (zh) * 2017-02-17 2020-04-14 大连理工大学 基于卷积神经网络的单通道混合语音时域分离方法
CN107240397A (zh) * 2017-08-14 2017-10-10 广东工业大学 一种基于声纹识别的智能锁及其语音识别方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205606A (zh) * 2016-08-15 2016-12-07 南京邮电大学 一种基于语音识别的动态定位监控方法及系统
CN106847309A (zh) * 2017-01-09 2017-06-13 华南理工大学 一种语音情感识别方法
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统

Also Published As

Publication number Publication date
CN107993071A (zh) 2018-05-04

Similar Documents

Publication Publication Date Title
WO2019100606A1 (zh) 电子装置、基于声纹的身份验证方法、系统及存储介质
WO2018166187A1 (zh) 服务器、身份验证方法、系统及计算机可读存储介质
WO2019136912A1 (zh) 电子装置、身份验证的方法、系统及存储介质
CN105702263B (zh) 语音重放检测方法和装置
JP6621536B2 (ja) 電子装置、身元認証方法、システム及びコンピュータ読み取り可能な記憶媒体
JP6429945B2 (ja) 音声データを処理するための方法及び装置
US9373330B2 (en) Fast speaker recognition scoring using I-vector posteriors and probabilistic linear discriminant analysis
CN110956966B (zh) 声纹认证方法、装置、介质及电子设备
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
CN110851835A (zh) 图像模型检测方法、装置、电子设备及存储介质
WO2019200744A1 (zh) 自更新的反欺诈方法、装置、计算机设备和存储介质
WO2021051572A1 (zh) 语音识别方法、装置以及计算机设备
WO2021042537A1 (zh) 语音识别认证方法及系统
WO2019196305A1 (zh) 电子装置、身份验证的方法及存储介质
CN112053695A (zh) 声纹识别方法、装置、电子设备及存储介质
WO2019218512A1 (zh) 服务器、声纹验证的方法及存储介质
CN105224844B (zh) 验证方法、系统和装置
CN110797033A (zh) 基于人工智能的声音识别方法、及其相关设备
US11783841B2 (en) Method for speaker authentication and identification
WO2019218515A1 (zh) 服务器、基于声纹的身份验证方法及存储介质
US20180261227A1 (en) Methods and systems for determining user liveness
CN116343798A (zh) 一种远场场景下说话人身份的验证方法和装置、电子设备
WO2021196458A1 (zh) 贷款智能进件方法、装置及存储介质
CN113436633B (zh) 说话人识别方法、装置、计算机设备及存储介质
CN113593579B (zh) 一种声纹识别方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18880600

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18880600

Country of ref document: EP

Kind code of ref document: A1