WO2020073519A1 - Voiceprint verification method and apparatus, computer device and storage medium - Google Patents

Voiceprint verification method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2020073519A1
WO2020073519A1 PCT/CN2018/124402 CN2018124402W WO2020073519A1 WO 2020073519 A1 WO2020073519 A1 WO 2020073519A1 CN 2018124402 W CN2018124402 W CN 2018124402W WO 2020073519 A1 WO2020073519 A1 WO 2020073519A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
feature
voiceprint feature
vector
distance value
Prior art date
Application number
PCT/CN2018/124402
Other languages
French (fr)
Chinese (zh)
Inventor
杨翘楚
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020073519A1 publication Critical patent/WO2020073519A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/16Hidden Markov models [HMM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of voiceprint verification, in particular to a method, device, computer equipment and storage medium for voiceprint verification.
  • the main purpose of this application is to provide a method of voiceprint verification, which aims to solve the problem that the voice data collected by the client needs to be sent to the background for voiceprint feature extraction in the existing voiceprint verification process, resulting in poor confidentiality of the voice data in transmission technical problem.
  • This application proposes a method for voiceprint verification, including:
  • the voiceprint verification server receives the first voiceprint feature sent by the client server
  • the voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature meets the preset requirements;
  • the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
  • This application also provides a voiceprint verification system, including a client, a client server, and a voiceprint verification server;
  • the client collects the voice signal of the identity to be verified, and sends the voice signal to the client server;
  • the client server receives the voice signal, extracts voiceprint features from the voice signal to obtain a first voiceprint feature, and transmits the first voiceprint feature to the voiceprint verification server;
  • the voiceprint verification server receives the first voiceprint feature, and compares the first voiceprint feature with a pre-stored voiceprint feature to determine whether the first voiceprint feature and the pre-stored voiceprint feature The same, and feedback the judgment result to the client server;
  • the client server controls the client to perform a feedback response according to the judgment result.
  • the present application also provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the steps of the foregoing method are implemented.
  • the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the above method.
  • the function of extracting the voiceprint feature vector is forwarded to the client server.
  • the voiceprint feature vector of the voice signal is directly extracted from the local client server, and then the voiceprint feature vector is extracted.
  • the verification server of third-party technical support for voiceprint verification the training of voiceprint verification model and speaker recognition process, because the voiceprint feature vector can no longer be reversed to restore the original data of the voice signal, which is conducive to the recording of customers
  • the voice signal is used for data confidentiality, which improves data security and improves the security of the customer's identity authentication process.
  • the data after extracting the voiceprint feature vector is transmitted to the server for voiceprint verification.
  • the voiceprint feature vector data is more portable than the original voice signal data, which greatly increases the transmission efficiency.
  • the present application is based on GMM-UBM to realize the mapping of each of the voiceprint feature vectors into low-dimensional voiceprint identification vectors i-vector, which reduces the calculation cost and reduces the cost of voiceprint verification.
  • the verification process by comparing and analyzing with the pre-stored data of multiple people, the equivalence rate of voiceprint verification is reduced, and the influence of the model errors of voiceprint verification is reduced.
  • FIG. 1 is a schematic flowchart of a method for voiceprint verification according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
  • a method of voiceprint verification collects information through a client and performs voiceprint verification through a server.
  • the method includes:
  • the MFCC (Mel Frequency Cepstrum Coefficient) type voiceprint feature of this embodiment has a non-linear feature, so that the analysis result of the customer's voice signal in each frequency band is closer to the characteristics of the real voice emitted by the human body, improving The effect of voiceprint verification.
  • the client server is used to construct the MFCC-type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data to form a first voiceprint feature.
  • the voiceprint feature vector corresponding to each frame of voice data is constructed according to the extracted MFCC type voiceprint features, and then the corresponding MFCC type voiceprint features are combined together by sorting the voice data of each frame of the voice signal to obtain The first voiceprint feature corresponding to the client's voice signal is still completed on the client server to enhance data confidentiality during data transmission.
  • the voiceprint verification server receives the first voiceprint feature sent by the client server.
  • the extraction of the first voiceprint feature is forwarded to the client server for completion, so that the client server directly extracts the first voiceprint feature corresponding to the voice signal on the client server after receiving the recorded voice signal of the client. Then transfer to the voiceprint verification server supported by the third party for voiceprint verification. Because the first voiceprint feature can no longer be restored to the original voice signal by reverse push, it is conducive to data confidentiality of the voice signal recorded by the customer, improves data security, and improves the security of the customer identity authentication process. At the same time, the first A voiceprint feature has a smaller data volume than a voice signal, greatly increasing transmission efficiency.
  • the voiceprint feature is extracted from the collected voice signal through the client server, and the extracted voiceprint feature is transmitted to the voiceprint verification server for voiceprint verification, so that the client server and voiceprint verification server for voiceprint feature extraction are separated.
  • the voiceprint verification server determines whether the feature distance value between the voiceprint identification vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirements.
  • the preset requirements in this embodiment include that the characteristic distance value reaches a specified preset threshold range, etc., and can be customized according to specific application scenarios to meet the personalized usage requirements more widely.
  • the server passes the verification result to the client through the server; otherwise, the verification failure result is fed back to the client, so that the client Perform further application operations based on the feedback results. For example, after the verification is passed, the smart door is controlled to open, etc. As another example, after a specified number of verification failures, the security system is controlled to lock the screen to prevent criminals from further damaging the electronic banking system.
  • step S4 of this embodiment includes:
  • This embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, Gaussian Mixture Model-Background Model) realizes the mapping of voiceprint feature vectors corresponding to each frame of speech data into low-dimensional voiceprint discrimination vectors i-vector.
  • the training process of the GMM-UBM in this embodiment is as follows: B1: Obtain a preset number of voice data samples (for example, 100,000), each voice data sample corresponds to a voiceprint discrimination vector, and each voice sample can be collected from a different People's speech in different environments, such speech data samples are used to train a general background model (GMM-UBM) that can characterize the general speech characteristics; B2.
  • GMM-UBM General Background model
  • each speech data sample separately to extract the correspondence of each speech data sample
  • Voiceprint features of the preset type and construct voiceprint feature vectors corresponding to each voice data sample based on the preset voiceprint features corresponding to each voice data sample
  • B3. Divide all the constructed voiceprint feature vectors of the preset types into The first percentage of the training set and the second percentage of the verification set, the first percentage and the second percentage are less than or equal to 100%
  • B4 using the voiceprint feature vector in the training set
  • the second model is trained, and after the training is completed, a verification set is used to verify the accuracy of the trained second model; B5.
  • the model training ends, otherwise, increasing the number of samples of voice data, and re-execute the above-described step B2, B3, B4, B5 based on the voice data samples increases.
  • the preset standard Rate e.g., 98.5%
  • the voiceprint discrimination vector of this embodiment is expressed by the voiceprint discrimination vector i-vector.
  • the voiceprint discrimination vector i-vector is a vector. Compared with the dimension of the Gaussian space, the voiceprint discrimination vector i-vector has a lower dimension, which Reduce computing costs.
  • the preset conditions in this embodiment include that the cosine distance value is within a specified threshold value range, etc., which can be set as needed.
  • the preset sorting is determined as Whether the first few first cosine distance values include the first cosine distance value corresponding to the target person's pre-stored voiceprint feature, and if so, it is determined that the cosine distance value satisfies the preset condition.
  • the cosine distance value is determined Meet the preset conditions.
  • step S41 of this embodiment includes:
  • S410 Input voiceprint feature vectors corresponding to each frame of extracted speech data to the GMM-UBM model, respectively, to obtain a Gaussian supervector representing the probability distribution of each frame of speech data on each Gaussian component.
  • S411 use the above-mentioned Gaussian supervectors using formulas ,
  • the low-dimensional voiceprint discrimination vector i-vector corresponding to each frame of speech data is calculated, where Is the Gaussian supervector of each frame of voice data, ⁇ is the mean supervector of the GMM-UBM model, and T is the low-dimensional voiceprint discrimination vector i-vector of each frame of voice data, It is a transformation matrix mapped into a high-dimensional Gaussian space.
  • EM algorithm refers to the maximum expectation algorithm (Expectation Maximization Algorithm, also known as Expectation Maximization Algorithm), is an iterative algorithm used in statistics to find the maximum likelihood estimation of parameters in probability models that depend on unobservable implicit variables.
  • the maximum expectation algorithm is calculated alternately through two steps: 1) calculate the expectation (E), use the existing estimates of the probability model parameters to calculate the expectation of the hidden variable; 2) maximize (M), use the E Hidden variable expectation and maximum likelihood estimation of the parameter model.
  • E expectation
  • M maximize
  • step S43 of this embodiment includes:
  • S430 Obtain the first cosine distance value between the pre-stored voiceprint features corresponding to each of the pre-stored voiceprint feature data and the first voiceprint feature, wherein the voiceprint feature data of multiple people includes Pre-stored voiceprint features of the target person.
  • the pre-stored voiceprint feature data of multiple persons including the target person is used to determine whether the voiceprint feature of the currently collected voice signal is the same as the target person's voiceprint feature, so as to improve the judgment accuracy.
  • This embodiment uses the cosine distance formula , Representing the first cosine distance value between each of the pre-stored voiceprint features and the first voiceprint feature, where x represents each prestored voiceprint discrimination vector, and y represents the voiceprint discrimination vector of the first voiceprint feature In i-vector, the smaller the cosine distance value, it means that the two voiceprint features are closer or the same.
  • the "first" in this embodiment is only used for distinction, not for limitation, and the functions in other places are the same, and will not be repeated.
  • S431 Sort the first cosine distance values in ascending order.
  • the first cosine distance value between each of the pre-stored voiceprint features and the first voiceprint feature is sorted from small to large, so as to more accurately analyze the first voiceprint feature and each pre-stored voiceprint The similarity distribution state of the feature, so as to obtain the verification of the first voiceprint feature more accurately.
  • S432 Determine whether the first cosine distance value of the preset preset number includes the first cosine distance value corresponding to the pre-stored voiceprint feature of the target person.
  • the first predetermined number of first cosine distance values include the first cosine distance value corresponding to the pre-stored voiceprint feature of the target person, and then the first voiceprint feature and the pre-stored target person are determined.
  • the voiceprint features are the same, to reduce the recognition error rate caused by model errors.
  • the above error rate is "the frequency of verification failures that should occur when the verification is passed, and the frequency of verification passes that occur when the verification should be failed. ".
  • the preset number of first cosine distance values in this embodiment includes 1, 2, or 3, etc., which can be set according to usage requirements.
  • step S43 of another embodiment of the present application includes:
  • S434 Obtain a second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature.
  • S435 Determine whether the second cosine distance value is less than or equal to a preset threshold.
  • the preset threshold is 0.6.
  • the cosine distance between the first voiceprint feature and the target user's pre-stored voiceprint feature is calculated to be less than or equal to a preset threshold, then the cosine distance value is determined to satisfy the preset condition, and the first voiceprint feature and the target user's pre-stored voiceprint are determined. If the features are the same, the verification is passed; if the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is greater than a preset threshold, it is determined that the distance value does not satisfy the preset condition, and the first voiceprint feature and the target are determined If the user's pre-stored voiceprint features are different, the verification fails.
  • This application also provides a voiceprint verification system, including a client, a client server, and a voiceprint verification server;
  • the client collects the voice signal of the identity to be verified and sends the voice signal to the client server;
  • the client server receives the voice signal, extracts voiceprint features from the voice signal to obtain a first voiceprint feature, and transmits the first voiceprint feature to the voiceprint verification server;
  • the voiceprint verification server receives the first voiceprint feature, and compares the first voiceprint feature with a pre-stored voiceprint feature to determine whether the first voiceprint feature and the pre-stored voiceprint feature The same, and feedback the judgment result to the client server;
  • the client server controls the client to perform a feedback response according to the judgment result.
  • the continuous analog signal of the voice signal in this embodiment is sampled by the client according to a specified sampling period to form a discrete analog signal, and specified encoding rules are quantized into digital signals; the client server receives the voice signal
  • the process of extracting voiceprint features from the voice signal to obtain the first voiceprint features is as follows:
  • pre-emphasis due to the physiological characteristics of the human body, the high-frequency components of the voice signal are often suppressed, the role of pre-emphasis is to compensate for the high-frequency components; in the above frame processing, due to the "instant stability" of the voice signal, spectrum analysis
  • a voice signal is framed (usually 10 to 30 milliseconds per frame)
  • feature extraction is performed in units of frames; after the above framed processing, windowing is performed to reduce the signal at the beginning and end of the frame
  • this embodiment uses a Hamming window for windowing.
  • the function of extracting the voiceprint feature vector is forwarded to the client server.
  • the client collects the voice signal through recording
  • the voiceprint feature vector of the voice signal is directly extracted from the local client server, and then the voiceprint feature is extracted.
  • the vector is transmitted to a third-party technical support verification server for voiceprint verification.
  • the training of the voiceprint verification model and the speaker recognition process because the voiceprint feature vector can no longer be reversed and restored to the original data of the voice signal, is conducive to recording customers
  • the voice signal is used to keep data confidential, improve data security, and improve the security of the customer's identity authentication process.
  • the data after extracting the voiceprint feature vector is transmitted to the server for voiceprint verification.
  • the voiceprint feature vector data is more portable than the original voice signal data, which greatly increases the transmission efficiency.
  • each of the voiceprint feature vectors is mapped to a low-dimensional voiceprint discrimination vector i-vector, which reduces the calculation cost and the use cost of voiceprint verification.
  • the verification process by comparing and analyzing with the pre-stored data of multiple people, the equivalence rate of voiceprint verification is reduced, and the influence of the model errors of voiceprint verification is reduced.
  • the judgment result includes that the first voiceprint feature is not the same as the pre-stored voiceprint feature
  • the client server controlling the feedback response process of the client according to the judgment result includes:
  • the client server generates feedback information about unsuccessful authentication and sends it to the client;
  • the client is controlled to be in a disabled state and an alarm is issued.
  • the voiceprint verification system includes an alarm and a safety control device to enhance the functional completeness of the voiceprint verification system in the actual application process and improve management security and information security.
  • an embodiment of the present application further provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 2.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the memory device provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store data such as voiceprint verification.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer nonvolatile readable storage medium on which computer readable instructions are stored.
  • the processes of the foregoing method embodiments are performed.
  • the above is only the preferred embodiment of the present application, and does not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related In the technical field, the same reason is included in the scope of patent protection of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Collating Specific Patterns (AREA)

Abstract

Disclosed is a voiceprint verification method. The method comprises: a client server extracting a voice signal to be subjected to identity verification, and extracting corresponding MFCC type voiceprint features, and constructing same as a first voiceprint feature corresponding to frames of voice data; a voiceprint verification server receiving the first voiceprint feature; the voiceprint verification server determining whether a feature distance value between voiceprint identification vectors (i-vectors) respectively corresponding to the first voiceprint feature and a pre-stored voiceprint feature meets a pre-set requirement; and if so, determining that the first voiceprint feature is the same as the pre-stored voiceprint feature.

Description

声纹验证的方法、装置、计算机设备以及存储介质Voiceprint verification method, device, computer equipment and storage medium
本申请要求于2018年10月11日提交中国专利局、申请号为2018111847753,发明名称为“声纹验证的方法、装置、计算机设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on October 11, 2018 in the Chinese Patent Office with the application number 2018111847753 and the invention titled "Method, Device, Computer Equipment, and Storage Media for Voiceprint Verification", all of which are approved by The reference is incorporated in this application.
技术领域Technical field
本申请涉及到声纹验证领域,特别是涉及到声纹验证的方法、装置、计算机设备以及存储介质。The present application relates to the field of voiceprint verification, in particular to a method, device, computer equipment and storage medium for voiceprint verification.
背景技术Background technique
目前,很多大型金融公司的业务范围涉及保险、银行、投资等多个业务范畴,而每个业务范畴通常都需要同客户进行沟通,且都需要进行反欺诈识别,因此,对客户的身份验证及反欺诈识别也就成为保证业务安全的重要组成部分。在客户身份验证环节中,声纹验证由于其具有的实时性和方便性而被许多公司采用。客户声纹模型的训练和客户身份的验证需要采集客户的语音数据,而语音数据的获得往往来源于金融公司与客户的谈话录音。发明人意识到,由于商业洽谈往往涉及机密内容,将语音数据由网络传输到后台再进行语音特征参数的提取不利于数据保密性。At present, the business scope of many large financial companies involves multiple business areas such as insurance, banking, and investment. Each business area usually needs to communicate with customers and anti-fraud identification is required. Therefore, the identity verification and Anti-fraud identification has become an important part of ensuring business security. In the process of customer identity verification, voiceprint verification is adopted by many companies due to its real-time and convenience. The training of the customer voiceprint model and the verification of the customer's identity need to collect the customer's voice data, and the acquisition of the voice data often comes from the recording of the conversation between the financial company and the customer. The inventor realized that since business negotiations often involve confidential content, transferring voice data from the network to the background and then extracting voice feature parameters is not conducive to data confidentiality.
技术问题technical problem
本申请的主要目的为提供声纹验证的方法,旨在解决现有声纹验证过程中需将客户端采集的语音数据发送至后台进行声纹特征提取,导致传输中语音数据的保密性较差的技术问题。The main purpose of this application is to provide a method of voiceprint verification, which aims to solve the problem that the voice data collected by the client needs to be sent to the background for voiceprint feature extraction in the existing voiceprint verification process, resulting in poor confidentiality of the voice data in transmission technical problem.
技术解决方案Technical solution
本申请提出一种声纹验证的方法,包括:This application proposes a method for voiceprint verification, including:
通过客户端服务器提取待验证身份的语音信号,并提取所述语音信号中各帧语音数据分别对应的MFCC类型声纹特征;Extract the voice signal of the identity to be verified through the client server, and extract the MFCC type voiceprint features corresponding to each frame of voice data in the voice signal;
通过所述客户端服务器将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成第一声纹特征;Constructing the MFCC type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data through the client server to form a first voiceprint feature;
声纹验证服务器接收所述客户端服务器发送的所述第一声纹特征;The voiceprint verification server receives the first voiceprint feature sent by the client server;
声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求;The voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature meets the preset requirements;
若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。If satisfied, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
本申请还提供了一种声纹验证系统,包括客户端、客户端服务器和声纹验证服务器;This application also provides a voiceprint verification system, including a client, a client server, and a voiceprint verification server;
所述客户端采集待验证身份的语音信号,并将所述语音信号发送到所述客户端服务器;The client collects the voice signal of the identity to be verified, and sends the voice signal to the client server;
所述客户端服务器接收所述语音信号,并对所述语音信号进行声纹特征提取得到第一声纹特征,将第一声纹特征传输至声纹验证服务器;The client server receives the voice signal, extracts voiceprint features from the voice signal to obtain a first voiceprint feature, and transmits the first voiceprint feature to the voiceprint verification server;
所述声纹验证服务器接收所述第一声纹特征,并将所述第一声纹特征与预存声纹特征进行比较分析,以判断所述第一声纹特征与所述预存声纹特征是否相同,并将判断结果反馈至所述客户端服务器;The voiceprint verification server receives the first voiceprint feature, and compares the first voiceprint feature with a pre-stored voiceprint feature to determine whether the first voiceprint feature and the pre-stored voiceprint feature The same, and feedback the judgment result to the client server;
所述客户端服务器根据所述判断结果控制所述客户端进行反馈响应。The client server controls the client to perform a feedback response according to the judgment result.
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述方法的步骤。The present application also provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the steps of the foregoing method are implemented.
本申请还提供了一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的方法的步骤。The present application also provides a computer non-volatile readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the above method.
有益效果Beneficial effect
本申请将声纹特征向量提取的功能前置到客户端服务器上完成,客户端通过录音采集语音信号后直接在本地的客户端服务器提取语音信号的声纹特征向量,然后再将声纹特征向量传输至第三方技术支持的验证服务器上进行声纹验证,声纹验证模型的训练和说话人辨认过程,由于声纹特征向量无法再反推还原为语音信号的原始数据,有利于对客户录音的语音信号进行数据保密,提高数据安全性,使客户身份认证流程的安全性得到了提高。本申请通过提取声纹特征向量后的数据传输至服务器进行声纹验证,声纹特征向量数据比原始语音信号数据更为轻便,大大增加了传输效率。本申请基于GMM-UBM实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量i-vector,降低计算成本,降低声纹验证的使用成本。在验证过程中通过与多人的预存数据进行比较分析,降低声纹验证的等错率,降低声纹验证的模型误差带来的影响。In this application, the function of extracting the voiceprint feature vector is forwarded to the client server. After the client collects the voice signal through recording, the voiceprint feature vector of the voice signal is directly extracted from the local client server, and then the voiceprint feature vector is extracted. Transmitted to the verification server of third-party technical support for voiceprint verification, the training of voiceprint verification model and speaker recognition process, because the voiceprint feature vector can no longer be reversed to restore the original data of the voice signal, which is conducive to the recording of customers The voice signal is used for data confidentiality, which improves data security and improves the security of the customer's identity authentication process. In this application, the data after extracting the voiceprint feature vector is transmitted to the server for voiceprint verification. The voiceprint feature vector data is more portable than the original voice signal data, which greatly increases the transmission efficiency. The present application is based on GMM-UBM to realize the mapping of each of the voiceprint feature vectors into low-dimensional voiceprint identification vectors i-vector, which reduces the calculation cost and reduces the cost of voiceprint verification. In the verification process, by comparing and analyzing with the pre-stored data of multiple people, the equivalence rate of voiceprint verification is reduced, and the influence of the model errors of voiceprint verification is reduced.
附图说明BRIEF DESCRIPTION
图1 本申请一实施例的声纹验证的方法流程示意图;FIG. 1 is a schematic flowchart of a method for voiceprint verification according to an embodiment of the present application;
图2本申请一实施例的计算机设备内部结构示意图。2 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
本发明的最佳实施方式Best Mode of the Invention
参照图1,本申请一实施例的声纹验证的方法,通过客户端采集信息,通过服务器进行声纹验证,方法包括:Referring to FIG. 1, a method of voiceprint verification according to an embodiment of the present application collects information through a client and performs voiceprint verification through a server. The method includes:
S1:通过客户端服务器提取待验证身份的语音信号,并提取所述语音信号中各帧语音数据分别对应的MFCC类型声纹特征。S1: Extract the voice signal of the identity to be verified through the client server, and extract the MFCC type voiceprint features corresponding to each frame of voice data in the voice signal.
本实施例的MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)类型声纹特征具有非线性特征,使客户的语音信号在各频段上的分析结果更贴近人体发出的真实语音的特征,提高声纹验证的效果。The MFCC (Mel Frequency Cepstrum Coefficient) type voiceprint feature of this embodiment has a non-linear feature, so that the analysis result of the customer's voice signal in each frequency band is closer to the characteristics of the real voice emitted by the human body, improving The effect of voiceprint verification.
S2:通过所述客户端服务器将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成第一声纹特征。S2: The client server is used to construct the MFCC-type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data to form a first voiceprint feature.
本实施例根据提取的MFCC类型声纹特征构建各帧语音数据分别对应的声纹特征向量,然后通过语音信号的各帧语音数据的排序,将分别对应的MFCC类型声纹特征组合在一起,得到客户的语音信号对应的第一声纹特征,上述构建过程依然在客户端服务器完成,以增强数据传输过程中的数据保密性。In this embodiment, the voiceprint feature vector corresponding to each frame of voice data is constructed according to the extracted MFCC type voiceprint features, and then the corresponding MFCC type voiceprint features are combined together by sorting the voice data of each frame of the voice signal to obtain The first voiceprint feature corresponding to the client's voice signal is still completed on the client server to enhance data confidentiality during data transmission.
S3:声纹验证服务器接收所述客户端服务器发送的所述第一声纹特征。S3: The voiceprint verification server receives the first voiceprint feature sent by the client server.
本实施例将第一声纹特征的提取工作前置到客户端服务器完成,以便客户端服务器接收录音采集的客户的语音信号后,直接在客户端服务器提取语音信号对应的第一声纹特征,然后再传输至第三方技术支持的声纹验证服务器进行声纹验证。由于第一声纹特征无法再通过反推还原为原始的语音信号,有利于对客户录音的语音信号进行数据保密,提高数据安全性,使客户身份认证流程的安全性得到了提高,同时,第一声纹特征比语音信号数据量更小,大大增加了传输效率。通过客户端服务器对采集的语音信号提取声纹特征,将提取后的声纹特征传输至声纹验证服务器进行声纹验证,使声纹特征提取的客户端服务器和声纹验证服务器进行分离。In this embodiment, the extraction of the first voiceprint feature is forwarded to the client server for completion, so that the client server directly extracts the first voiceprint feature corresponding to the voice signal on the client server after receiving the recorded voice signal of the client. Then transfer to the voiceprint verification server supported by the third party for voiceprint verification. Because the first voiceprint feature can no longer be restored to the original voice signal by reverse push, it is conducive to data confidentiality of the voice signal recorded by the customer, improves data security, and improves the security of the customer identity authentication process. At the same time, the first A voiceprint feature has a smaller data volume than a voice signal, greatly increasing transmission efficiency. The voiceprint feature is extracted from the collected voice signal through the client server, and the extracted voiceprint feature is transmitted to the voiceprint verification server for voiceprint verification, so that the client server and voiceprint verification server for voiceprint feature extraction are separated.
S4:声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求。S4: The voiceprint verification server determines whether the feature distance value between the voiceprint identification vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirements.
本实施例的预设要求包括特征距离值达到指定的预设阈值范围等,可根据具体的应用场景进行自定义设定,以更广泛地满足个性化使用需求。The preset requirements in this embodiment include that the characteristic distance value reaches a specified preset threshold range, etc., and can be customized according to specific application scenarios to meet the personalized usage requirements more widely.
S5:若满足,则判定第一声纹特征与预存声纹特征相同,否则不相同。S5: If satisfied, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
本实施例将判定所述第一声纹特征与所述预存声纹特征相同,则通过服务器向客户端反馈验证通过的结果到客户端,否则,反馈验证失败的结果到客户端,以便客户端根据反馈结果进行进一步的应用操作。举例地,验证通过后控制智能门打开等。再举例地,验证失败指定次数后控制安全系统进行锁屏,以防犯罪分子进一步破坏电子银行系统。In this embodiment, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, and the server passes the verification result to the client through the server; otherwise, the verification failure result is fed back to the client, so that the client Perform further application operations based on the feedback results. For example, after the verification is passed, the smart door is controlled to open, etc. As another example, after a specified number of verification failures, the security system is controlled to lock the screen to prevent criminals from further damaging the electronic banking system.
进一步地,本实施例的步骤S4,包括:Further, step S4 of this embodiment includes:
S41:将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector。S41: Map voiceprint feature vectors corresponding to each frame of speech data to low-dimensional voiceprint identification vectors i-vector, respectively.
本实施例基于GMM-UBM(Gaussian Mixture Model-Universal Background Model,高斯混合模型-背景模型)实现将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector。本实施例的GMM-UBM的训练过程如下:B1:获取预设数量(例如,10万个)的语音数据样本,每个语音数据样本对应一个声纹鉴别向量,每个语音样本可以采集自不同的人在不同环境中的语音,这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型(GMM-UBM);B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之后小于或等于100%;B4、利用训练集中的声纹特征向量对所述第二模型进行训练,并在训练完成之后利用验证集对训练的所述第二模型的准确率进行验证;B5、若准确率大于预设准确率(例如,98.5%),则模型训练结束,否则,增加语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。This embodiment is based on GMM-UBM (Gaussian Mixture Model-Universal Background Model, Gaussian Mixture Model-Background Model) realizes the mapping of voiceprint feature vectors corresponding to each frame of speech data into low-dimensional voiceprint discrimination vectors i-vector. The training process of the GMM-UBM in this embodiment is as follows: B1: Obtain a preset number of voice data samples (for example, 100,000), each voice data sample corresponds to a voiceprint discrimination vector, and each voice sample can be collected from a different People's speech in different environments, such speech data samples are used to train a general background model (GMM-UBM) that can characterize the general speech characteristics; B2. Process each speech data sample separately to extract the correspondence of each speech data sample Voiceprint features of the preset type, and construct voiceprint feature vectors corresponding to each voice data sample based on the preset voiceprint features corresponding to each voice data sample; B3. Divide all the constructed voiceprint feature vectors of the preset types into The first percentage of the training set and the second percentage of the verification set, the first percentage and the second percentage are less than or equal to 100%; B4, using the voiceprint feature vector in the training set The second model is trained, and after the training is completed, a verification set is used to verify the accuracy of the trained second model; B5. If the accuracy is greater than the preset standard Rate (e.g., 98.5%), the model training ends, otherwise, increasing the number of samples of voice data, and re-execute the above-described step B2, B3, B4, B5 based on the voice data samples increases.
本实施例的声纹鉴别向量采用声纹鉴别向量i-vector表达,声纹鉴别向量i-vector是一个向量,相对于高斯空间的维度来讲,声纹鉴别向量i-vector维度更低,便于降低计算成本。The voiceprint discrimination vector of this embodiment is expressed by the voiceprint discrimination vector i-vector. The voiceprint discrimination vector i-vector is a vector. Compared with the dimension of the Gaussian space, the voiceprint discrimination vector i-vector has a lower dimension, which Reduce computing costs.
[援引加入(细则20.5) 01.02.2019] 
S42:通过余弦距离公式
Figure WO-DOC-FIGURE-1
,计算第一声纹特征对应的声纹鉴别向量i-vector与预存声纹特征对应的声纹鉴别向量i-vector之间的距离值,其中,x代表预存声纹特征对应的声纹鉴别向量i-vector,y代表第一声纹特征对应的声纹鉴别向量i-vector。
[Quote to join (Rules 20.5) 01.02.2019]
S42: Pass the cosine distance formula
Figure WO-DOC-FIGURE-1
, Calculate the distance between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the voiceprint discrimination vector i-vector corresponding to the pre-stored voiceprint feature, where x represents the voiceprint discrimination vector corresponding to the pre-stored voiceprint feature i-vector, y represents the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature.
S43:判断所述余弦距离值是否满足预设条件。S43: Determine whether the cosine distance value meets a preset condition.
本实施例的预设条件包括余弦距离值在指定的阈值范围内等,可根据需要设定。本实施例通过将预存的多个人的声纹特征数据中各自对应的预存声纹特征与所述第一声纹特征分别计算的第一余弦距离值进行从小到大排序,判断预设排序在前的几个第一余弦距离值中是否包括目标人的预存声纹特征对应的第一余弦距离值,若包括则判定余弦距离值满足预设条件。本申请另一实施例通过判断目标人的预存声纹特征与所述第一声纹特征之间的第二余弦距离值是否小于或等于预设阈值,若小于或等于,则判定余弦距离值满足预设条件。The preset conditions in this embodiment include that the cosine distance value is within a specified threshold value range, etc., which can be set as needed. In this embodiment, by sorting the first cosine distance values respectively calculated from the pre-stored voiceprint features corresponding to the pre-stored voiceprint feature data of multiple people and the first voiceprint feature from small to large, the preset sorting is determined as Whether the first few first cosine distance values include the first cosine distance value corresponding to the target person's pre-stored voiceprint feature, and if so, it is determined that the cosine distance value satisfies the preset condition. In another embodiment of the present application, by determining whether the second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature is less than or equal to a preset threshold, if less than or equal to, the cosine distance value is determined Meet the preset conditions.
S44:若所述余弦距离值满足预设条件,则判定所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值满足预设要求,否则不满足预设要求。S44: If the cosine distance value satisfies the preset condition, it is determined that the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirement, otherwise it is not Meet the preset requirements.
进一步地,本实施例的步骤S41,包括:Further, step S41 of this embodiment includes:
S410:将提取得到的各帧语音数据分别对应的声纹特征向量分别输入到GMM-UBM模型,得到表征各帧语音数据在各高斯分量上的概率分布的高斯超向量。S410: Input voiceprint feature vectors corresponding to each frame of extracted speech data to the GMM-UBM model, respectively, to obtain a Gaussian supervector representing the probability distribution of each frame of speech data on each Gaussian component.
[援引加入(细则20.5) 01.02.2019] 
S411:将各所述高斯超向量利用公式
Figure WO-DOC-FIGURE-2
,计算得到各帧语音数据分别对应的低维度的声纹鉴别向量i-vector,其中
Figure WO-DOC-FIGURE-3
为各帧语音数据的高斯超向量,μ为所述GMM-UBM模型的均值超向量,T为各帧语音数据的低维度的声纹鉴别向量i-vector,
Figure WO-DOC-FIGURE-4
为映射到高维度的高斯空间的转换矩阵。
[Quote to join (Rules 20.5) 01.02.2019]
S411: use the above-mentioned Gaussian supervectors using formulas
Figure WO-DOC-FIGURE-2
, The low-dimensional voiceprint discrimination vector i-vector corresponding to each frame of speech data is calculated, where
Figure WO-DOC-FIGURE-3
Is the Gaussian supervector of each frame of voice data, μ is the mean supervector of the GMM-UBM model, and T is the low-dimensional voiceprint discrimination vector i-vector of each frame of voice data,
Figure WO-DOC-FIGURE-4
It is a transformation matrix mapped into a high-dimensional Gaussian space.
本实施例的T训练采用EM算法。EM算法,指的是最大期望算法(Expectation Maximization Algorithm,又译期望最大化算法),是一种迭代算法,在统计学中被用于寻找,依赖于不可观察的隐性变量的概率模型中,参数的最大似然估计。最大期望算法经过两个步骤交替进行计算 :1)计算期望(E),利用概率模型参数的现有估计值,计算隐藏变量的期望;2)最大化(M),利用E 步上求得的隐藏变量的期望,对参数模型进行最大似然估计。上步找到的参数估计值被用于下步计算中,不断交替进行。The T training of this embodiment uses the EM algorithm. EM algorithm refers to the maximum expectation algorithm (Expectation Maximization Algorithm, also known as Expectation Maximization Algorithm), is an iterative algorithm used in statistics to find the maximum likelihood estimation of parameters in probability models that depend on unobservable implicit variables. The maximum expectation algorithm is calculated alternately through two steps: 1) calculate the expectation (E), use the existing estimates of the probability model parameters to calculate the expectation of the hidden variable; 2) maximize (M), use the E Hidden variable expectation and maximum likelihood estimation of the parameter model. The parameter estimates found in the previous step are used in the next step of the calculation, and are alternately performed.
进一步地,本实施例的步骤S43,包括:Further, step S43 of this embodiment includes:
S430:分别获取预存的多个人的声纹特征数据中各自对应的预存声纹特征与所述第一声纹特征之间的第一余弦距离值,其中,多个人的声纹特征数据中包括目标人的预存声纹特征。S430: Obtain the first cosine distance value between the pre-stored voiceprint features corresponding to each of the pre-stored voiceprint feature data and the first voiceprint feature, wherein the voiceprint feature data of multiple people includes Pre-stored voiceprint features of the target person.
[援引加入(细则20.5) 01.02.2019] 
本实施例通过将预存的包括目标人的多人的声纹特征数据,同时用于判断当前采集的语音信号的声纹特征是否与目标人的声纹特征相同,以提高判断准确性。本实施例通过余弦距离公式
Figure WO-DOC-FIGURE-5
,表示各所述预存声纹特征与所述第一声纹特征之间的第一余弦距离值,其中,x代表各预存声纹鉴别向量,y代表第一声纹特征的声纹鉴别向量i-vector,余弦距离值越小,表明两声纹特征更接近或相同。本实施例的“第一”,仅用作区别,不用于限定,其他处的作用相同,不赘述。
[Quote to join (Rules 20.5) 01.02.2019]
In this embodiment, the pre-stored voiceprint feature data of multiple persons including the target person is used to determine whether the voiceprint feature of the currently collected voice signal is the same as the target person's voiceprint feature, so as to improve the judgment accuracy. This embodiment uses the cosine distance formula
Figure WO-DOC-FIGURE-5
, Representing the first cosine distance value between each of the pre-stored voiceprint features and the first voiceprint feature, where x represents each prestored voiceprint discrimination vector, and y represents the voiceprint discrimination vector of the first voiceprint feature In i-vector, the smaller the cosine distance value, it means that the two voiceprint features are closer or the same. The "first" in this embodiment is only used for distinction, not for limitation, and the functions in other places are the same, and will not be repeated.
S431:将各所述第一余弦距离值按照从小到大的顺序进行排序。S431: Sort the first cosine distance values in ascending order.
本实施例通过将各所述预存声纹特征与所述第一声纹特征之间的第一余弦距离值进行从小到大排序,以便更准确地分析第一声纹特征与各预存声纹特征的相似度分布状态,以便更准确地获得对第一声纹特征的验证。In this embodiment, the first cosine distance value between each of the pre-stored voiceprint features and the first voiceprint feature is sorted from small to large, so as to more accurately analyze the first voiceprint feature and each pre-stored voiceprint The similarity distribution state of the feature, so as to obtain the verification of the first voiceprint feature more accurately.
S432:判断排序在前的预设数量的第一余弦距离值中,是否包括所述目标人的预存声纹特征对应的第一余弦距离值。S432: Determine whether the first cosine distance value of the preset preset number includes the first cosine distance value corresponding to the pre-stored voiceprint feature of the target person.
本实施例通过排序在前的预设数量的第一余弦距离值中包括所述目标人的预存声纹特征对应的第一余弦距离值,则判定第一声纹特征与预存的目标人的声纹特征相同,以减小模型误差带来的识别等错率,上述等错率为“应验证通过时发生的验证未通过的频率,与应验证未通过时发生的验证通过的频率相等”。本实施例的预设数量的第一余弦距离值包括1个、2个或3个等,可根据使用需求进行自设定。In this embodiment, the first predetermined number of first cosine distance values include the first cosine distance value corresponding to the pre-stored voiceprint feature of the target person, and then the first voiceprint feature and the pre-stored target person are determined. The voiceprint features are the same, to reduce the recognition error rate caused by model errors. The above error rate is "the frequency of verification failures that should occur when the verification is passed, and the frequency of verification passes that occur when the verification should be failed. ". The preset number of first cosine distance values in this embodiment includes 1, 2, or 3, etc., which can be set according to usage requirements.
S433:若是,则判定余弦距离值满足预设条件,否则不满足预设条件。S433: If yes, it is determined that the cosine distance value meets the preset condition, otherwise, the preset condition is not met.
进一步地,本申请另一实施例的步骤S43,包括:Further, step S43 of another embodiment of the present application includes:
S434:获取目标人的预存声纹特征与第一声纹特征间的第二余弦距离值。S434: Obtain a second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature.
本实施例通过只针对性地比较一个第二余弦距离值,减小比较计算量,提高验证速率。In this embodiment, by comparing only a second cosine distance value in a targeted manner, the amount of comparison calculation is reduced, and the verification rate is improved.
S435:判断所述第二余弦距离值是否小于或等于预设阈值。S435: Determine whether the second cosine distance value is less than or equal to a preset threshold.
本实施例通过设定第一声纹特征与目标用户的预存声纹特征的距离阈值,实现有效的声纹验证。举例地,预设阈值为0.6。In this embodiment, by setting a distance threshold between the first voiceprint feature and the pre-stored voiceprint feature of the target user, effective voiceprint verification is achieved. For example, the preset threshold is 0.6.
S436:若是,则判定余弦距离值满足预设条件,否则不满足预设条件。S436: If yes, it is determined that the cosine distance value meets the preset condition, otherwise, the preset condition is not met.
本实施例计算第一声纹特征与目标用户的预存声纹特征的余弦距离小于或等于预设阈值,则判定余弦距离值满足预设条件,确定第一声纹特征与目标用户的预存声纹特征相同,则验证通过;若计算第一声纹特征与目标用户的预存声纹特征的余弦距离大于预设阈值,则判定所述距离值不满足预设条件,确定第一声纹特征与目标用户的预存声纹特征不相同,则验证失败。In this embodiment, the cosine distance between the first voiceprint feature and the target user's pre-stored voiceprint feature is calculated to be less than or equal to a preset threshold, then the cosine distance value is determined to satisfy the preset condition, and the first voiceprint feature and the target user's pre-stored voiceprint are determined. If the features are the same, the verification is passed; if the cosine distance between the first voiceprint feature and the pre-stored voiceprint feature of the target user is greater than a preset threshold, it is determined that the distance value does not satisfy the preset condition, and the first voiceprint feature and the target are determined If the user's pre-stored voiceprint features are different, the verification fails.
本申请还提供了一种声纹验证系统,包括客户端、客户端服务器和声纹验证服务器;This application also provides a voiceprint verification system, including a client, a client server, and a voiceprint verification server;
客户端采集待验证身份的语音信号,并将语音信号发送到客户端服务器;The client collects the voice signal of the identity to be verified and sends the voice signal to the client server;
所述客户端服务器接收所述语音信号,并对所述语音信号进行声纹特征提取得到第一声纹特征,将第一声纹特征传输至声纹验证服务器;The client server receives the voice signal, extracts voiceprint features from the voice signal to obtain a first voiceprint feature, and transmits the first voiceprint feature to the voiceprint verification server;
所述声纹验证服务器接收所述第一声纹特征,并将所述第一声纹特征与预存声纹特征进行比较分析,以判断所述第一声纹特征与所述预存声纹特征是否相同,并将判断结果反馈至所述客户端服务器;The voiceprint verification server receives the first voiceprint feature, and compares the first voiceprint feature with a pre-stored voiceprint feature to determine whether the first voiceprint feature and the pre-stored voiceprint feature The same, and feedback the judgment result to the client server;
所述客户端服务器根据所述判断结果控制所述客户端进行反馈响应。The client server controls the client to perform a feedback response according to the judgment result.
进一步地,本实施例的所述语音信号的连续模拟信号通过客户端按照指定采样周期进行采样,以形成离散模拟信号,并指定编码规则量化为数字信号;所述客户端服务器接收所述语音信号,并对所述语音信号进行声纹特征提取得到第一声纹特征的过程如下:Further, the continuous analog signal of the voice signal in this embodiment is sampled by the client according to a specified sampling period to form a discrete analog signal, and specified encoding rules are quantized into digital signals; the client server receives the voice signal The process of extracting voiceprint features from the voice signal to obtain the first voiceprint features is as follows:
[援引加入(细则20.5) 01.02.2019] 
S101,所述客户端服务器将所述数字信号进行预加重后,对预加重的数字信号进行分帧处理,得到各帧语音数据;S102,根据
Figure WO-DOC-FIGURE-6
将各帧语音数据从线性频谱域映射到梅尔频谱域,其中,
Figure WO-DOC-FIGURE-7
表示梅尔频谱值,
Figure WO-DOC-FIGURE-8
表示线性频谱值;S103,将转化为梅尔频谱域的各帧语音数据输入到一组梅尔三角滤波器组,计算每个频段的梅尔三角滤波器输出的对数能量,得到各帧语音数据分别对应的对数能量序列;S104,将各所述对数能量序列进行离散余弦变换,得到各帧语音数据分别对应的MFCC类型声纹特征;将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成所述第一声纹特征。
[Quote to join (Rules 20.5) 01.02.2019]
S101, after pre-emphasizing the digital signal, the client server performs framing processing on the pre-emphasized digital signal to obtain each frame of voice data; S102, according to
Figure WO-DOC-FIGURE-6
Map each frame of speech data from the linear spectrum domain to the Mel spectrum domain, where,
Figure WO-DOC-FIGURE-7
Represents the Mel spectrum value,
Figure WO-DOC-FIGURE-8
Represents the linear spectrum value; S103, input each frame of speech data converted into the mel spectrum domain into a group of mel triangle filter groups, calculate the log energy of the mel triangle filter output of each frequency band, and obtain each frame of speech Log energy sequences corresponding to the data respectively; S104, performing discrete cosine transformation on each of the log energy sequences to obtain MFCC type voiceprint features corresponding to each frame of speech data; constructing the MFCC type voiceprint features into each frame Voiceprint feature vectors corresponding to the voice data respectively to form the first voiceprint feature.
上述预加重,由于人体的生理特性,语音信号的高频成分往往被压抑,预加重的作用是补偿高频成分;上述分帧处理中,由于语音信号的“瞬时平稳性”,在进行频谱分析时对一段话音信号进行分帧处理(一般为10至30毫秒一帧),然后以帧为单位进行特征提取;上述分帧处理后进行了加窗处理,作用是减少帧起始和结束地方信号的不连续性问题,本实施例采用汉明窗进行加窗处理。The above pre-emphasis, due to the physiological characteristics of the human body, the high-frequency components of the voice signal are often suppressed, the role of pre-emphasis is to compensate for the high-frequency components; in the above frame processing, due to the "instant stability" of the voice signal, spectrum analysis When a voice signal is framed (usually 10 to 30 milliseconds per frame), then feature extraction is performed in units of frames; after the above framed processing, windowing is performed to reduce the signal at the beginning and end of the frame For the discontinuity problem, this embodiment uses a Hamming window for windowing.
本实施例将声纹特征向量提取的功能前置到客户端服务器上完成,客户端通过录音采集语音信号后直接在本地的客户端服务器提取语音信号的声纹特征向量,然后再将声纹特征向量传输至第三方技术支持的验证服务器上进行声纹验证,声纹验证模型的训练和说话人辨认过程,由于声纹特征向量无法再反推还原为语音信号的原始数据,有利于对客户录音的语音信号进行数据保密,提高数据安全性,使客户身份认证流程的安全性得到了提高。本实施例通过提取声纹特征向量后的数据传输至服务器进行声纹验证,声纹特征向量数据比原始语音信号数据更为轻便,大大增加了传输效率。本实施例基于GMM-UBM实现将各所述声纹特征向量分别映射为低维度的声纹鉴别向量i-vector,降低计算成本,降低声纹验证的使用成本。在验证过程中通过与多人的预存数据进行比较分析,降低声纹验证的等错率,降低声纹验证的模型误差带来的影响。In this embodiment, the function of extracting the voiceprint feature vector is forwarded to the client server. After the client collects the voice signal through recording, the voiceprint feature vector of the voice signal is directly extracted from the local client server, and then the voiceprint feature is extracted. The vector is transmitted to a third-party technical support verification server for voiceprint verification. The training of the voiceprint verification model and the speaker recognition process, because the voiceprint feature vector can no longer be reversed and restored to the original data of the voice signal, is conducive to recording customers The voice signal is used to keep data confidential, improve data security, and improve the security of the customer's identity authentication process. In this embodiment, the data after extracting the voiceprint feature vector is transmitted to the server for voiceprint verification. The voiceprint feature vector data is more portable than the original voice signal data, which greatly increases the transmission efficiency. In this embodiment, based on GMM-UBM, each of the voiceprint feature vectors is mapped to a low-dimensional voiceprint discrimination vector i-vector, which reduces the calculation cost and the use cost of voiceprint verification. In the verification process, by comparing and analyzing with the pre-stored data of multiple people, the equivalence rate of voiceprint verification is reduced, and the influence of the model errors of voiceprint verification is reduced.
进一步地,判断结果包括第一声纹特征与预存声纹特征不相同,所述客户端服务器根据所述判断结果控制所述客户端进行反馈响应的过程,包括:Further, the judgment result includes that the first voiceprint feature is not the same as the pre-stored voiceprint feature, and the client server controlling the feedback response process of the client according to the judgment result includes:
客户端服务器生成身份验证不成功的反馈信息并发送至所述客户端;The client server generates feedback information about unsuccessful authentication and sends it to the client;
判断预设时间内根据所述第一声纹特征生成身份验证不成功的反馈信息的次数,是否超过预设次数。It is determined whether the number of times that the feedback information of unsuccessful identity verification is generated according to the first voiceprint feature within a preset time exceeds a preset number of times.
若超过预设次数,则控制所述客户端处于禁用状态,并发出警报。If the preset times are exceeded, the client is controlled to be in a disabled state and an alarm is issued.
本声纹验证系统包括警报和安全管控装置,以增强该声纹验证系统在实际应用过程的功能完备性,提高管理安全和信息安全。The voiceprint verification system includes an alarm and a safety control device to enhance the functional completeness of the voiceprint verification system in the actual application process and improve management security and information security.
参照图2,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图2所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储声纹验证等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令在执行时,执行如上述各方法的实施例的流程。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。Referring to FIG. 2, an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 2. The computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor designed by the computer is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The memory device provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer device is used to store data such as voiceprint verification. The network interface of the computer device is used to communicate with external terminals through a network connection. When the computer-readable instructions are executed, the processes of the foregoing method embodiments are executed. Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
本申请一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,该计算机可读指令在执行时,执行如上述各方法的实施例的流程。以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。An embodiment of the present application further provides a computer nonvolatile readable storage medium on which computer readable instructions are stored. When the computer readable instructions are executed, the processes of the foregoing method embodiments are performed. The above is only the preferred embodiment of the present application, and does not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related In the technical field, the same reason is included in the scope of patent protection of this application.

Claims (18)

  1. 一种声纹验证的方法,其特征在于,包括:A method of voiceprint verification, characterized in that it includes:
    通过客户端服务器提取待验证身份的语音信号,并提取所述语音信号中各帧语音数据分别对应的MFCC类型声纹特征;Extract the voice signal of the identity to be verified through the client server, and extract the MFCC type voiceprint features corresponding to each frame of voice data in the voice signal;
    通过所述客户端服务器将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成第一声纹特征;Constructing the MFCC type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data through the client server to form a first voiceprint feature;
    声纹验证服务器接收所述客户端服务器发送的所述第一声纹特征;The voiceprint verification server receives the first voiceprint feature sent by the client server;
    声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求;The voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature meets the preset requirements;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。If satisfied, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
  2. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求1所述的声纹验证的方法,其特征在于,所述声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求的步骤,包括:
    将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector;
    通过余弦距离公式
    Figure WO-DOC-FIGURE-1
    ,计算第一声纹特征对应的声纹鉴别向量i-vector与预存声纹特征对应的声纹鉴别向量i-vector之间的余弦距离值cos(x,y),其中,x代表预存声纹特征对应的声纹鉴别向量i-vector,y代表第一声纹特征对应的声纹鉴别向量i-vector;
    判断所述余弦距离值是否满足预设条件;
    若满足,则判定所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值满足预设要求,否则不满足预设要求。
    [Quote to join (Rules 20.5) 01.02.2019]
    The voiceprint verification method according to claim 1, wherein the voiceprint verification server judges the features between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively The steps to determine whether the distance value meets the preset requirements include:
    Map the voiceprint feature vectors corresponding to each frame of voice data to low-dimensional voiceprint identification vectors i-vector;
    Pass the cosine distance formula
    Figure WO-DOC-FIGURE-1
    , Calculate the cosine distance cos (x, y) between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the voiceprint discrimination vector i-vector corresponding to the pre-stored voiceprint feature, where x represents the pre-stored voiceprint The voiceprint discrimination vector i-vector corresponding to the feature, y represents the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature;
    Determine whether the cosine distance value meets the preset condition;
    If satisfied, it is determined that the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirements, otherwise the preset requirements are not met.
  3. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求2所述的声纹验证的方法,其特征在于,所述将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector的步骤,包括:
    将提取得到的各帧语音数据分别对应的声纹特征向量分别输入到GMM-UBM模型,得到表征各帧语音数据在各高斯分量上的概率分布的高斯超向量;
    将各所述高斯超向量利用公式
    Figure WO-DOC-FIGURE-2
    ,计算得到各帧语音数据分别对应的低维度的声纹鉴别向量i-vector,其中
    Figure WO-DOC-FIGURE-3
    为各帧语音数据的高斯超向量,μ为所述GMM-UBM模型的均值超向量,T为各帧语音数据的低维度的声纹鉴别向量i-vector,
    Figure WO-DOC-FIGURE-4
     为映射到高维度的高斯空间的转换矩阵。
    [Quote to join (Rules 20.5) 01.02.2019]
    The method for voiceprint verification according to claim 2, wherein the step of mapping the voiceprint feature vectors corresponding to the voice data of each frame to the low-dimensional voiceprint discrimination vector i-vector includes:
    The voiceprint feature vectors corresponding to the extracted speech data of each frame are input into the GMM-UBM model, respectively, to obtain a Gaussian supervector representing the probability distribution of each frame of speech data on each Gaussian component;
    Use each Gaussian supervector with the formula
    Figure WO-DOC-FIGURE-2
    , The low-dimensional voiceprint discrimination vector i-vector corresponding to each frame of speech data is calculated, where
    Figure WO-DOC-FIGURE-3
    Is the Gaussian supervector of each frame of voice data, μ is the mean supervector of the GMM-UBM model, and T is the low-dimensional voiceprint discrimination vector i-vector of each frame of voice data,
    Figure WO-DOC-FIGURE-4
    It is a transformation matrix mapped into a high-dimensional Gaussian space.
  4. 根据权利要求2所述的声纹验证的方法,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The voiceprint verification method according to claim 2, wherein the step of determining whether the cosine distance value meets a preset condition includes:
    分别获取预存的多个人的声纹特征数据中各自对应的预存声纹特征与所述第一声纹特征之间的第一余弦距离值,其中,多个人的声纹特征数据中包括目标人的预存声纹特征;Obtaining the first cosine distance value between the pre-stored voiceprint features corresponding to each of the pre-stored voiceprint feature data of the multiple persons and the first voiceprint feature respectively, wherein the voiceprint feature data of the multiple persons includes the target person Pre-stored voiceprint features;
    将各所述第一余弦距离值按照从小到大的顺序进行排序;Sort the first cosine distance values in order from small to large;
    判断排序在前的预设数量的第一余弦距离值中,是否包括所述目标人的预存声纹特征对应的第一余弦距离值;Judging whether the preset first number of first cosine distance values includes the first cosine distance value corresponding to the target person's pre-stored voiceprint feature;
    若是,则判定所述第一余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the first cosine distance value meets a preset condition, otherwise, the preset condition is not met.
  5. 根据权利要求2所述的声纹验证的方法,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The voiceprint verification method according to claim 2, wherein the step of determining whether the cosine distance value meets a preset condition includes:
    获取目标人的预存声纹特征与所述第一声纹特征之间的第二余弦距离值;Acquiring a second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature;
    判断所述第二余弦距离值是否小于或等于预设阈值;Determine whether the second cosine distance value is less than or equal to a preset threshold;
    若是,则判定所述第二余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the second cosine distance value meets the preset condition, otherwise, the preset condition is not met.
  6. 一种声纹验证系统,其特征在于,包括客户端、客户端服务器和声纹验证服务器;A voiceprint verification system is characterized by including a client, a client server and a voiceprint verification server;
    所述客户端采集待验证身份的语音信号,并将所述语音信号发送到所述客户端服务器;The client collects the voice signal of the identity to be verified, and sends the voice signal to the client server;
    所述客户端服务器接收所述语音信号,并对所述语音信号进行声纹特征提取得到第一声纹特征,将第一声纹特征传输至声纹验证服务器;The client server receives the voice signal, extracts voiceprint features from the voice signal to obtain a first voiceprint feature, and transmits the first voiceprint feature to the voiceprint verification server;
    所述声纹验证服务器接收所述第一声纹特征,并将所述第一声纹特征与预存声纹特征进行比较分析,以判断所述第一声纹特征与所述预存声纹特征是否相同,并将判断结果反馈至所述客户端服务器;The voiceprint verification server receives the first voiceprint feature, and compares the first voiceprint feature with a pre-stored voiceprint feature to determine whether the first voiceprint feature and the pre-stored voiceprint feature The same, and feedback the judgment result to the client server;
    所述客户端服务器根据所述判断结果控制所述客户端进行反馈响应。The client server controls the client to perform a feedback response according to the judgment result.
  7. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求6所述的声纹验证系统,其特征在于,所述语音信号的连续模拟信号通过客户端按照指定采样周期进行采样,以形成离散模拟信号,并通过指定编码规则量化为数字信号;所述客户端服务器接收所述语音信号,并对所述语音信号进行声纹特征提取得到第一声纹特征的过程,包括:
    所述客户端服务器将所述数字信号进行预加重后,对预加重的数字信号进行分帧处理,得到各帧语音数据;
    根据
    Figure WO-DOC-FIGURE-5
    将所述各帧语音数据从线性频谱域映射到梅尔频谱域,其中,
    Figure WO-DOC-FIGURE-6
    表示梅尔频谱值,
    Figure WO-DOC-FIGURE-7
    表示线性频谱值;
    将转化为梅尔频谱域的各帧语音数据输入到梅尔三角滤波器组,计算每个频段的梅尔三角滤波器输出的对数能量,得到各帧语音数据分别对应的对数能量序列;
    将各所述对数能量序列进行离散余弦变换,得到各帧语音数据分别对应的MFCC类型声纹特征;
    将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成所述第一声纹特征。
    [Quote to join (Rules 20.5) 01.02.2019]
    The voiceprint verification system according to claim 6, wherein the continuous analog signal of the voice signal is sampled by the client according to a specified sampling period to form a discrete analog signal, and quantized into a digital signal by a specified encoding rule; The process of the client server receiving the voice signal and extracting voiceprint features from the voice signal to obtain the first voiceprint features includes:
    After pre-emphasizing the digital signal, the client server performs frame-dividing processing on the pre-emphasized digital signal to obtain each frame of voice data;
    according to
    Figure WO-DOC-FIGURE-5
    Mapping the speech data of each frame from the linear spectrum domain to the Mel spectrum domain, where,
    Figure WO-DOC-FIGURE-6
    Represents the Mel spectrum value,
    Figure WO-DOC-FIGURE-7
    Represents the linear spectrum value;
    Input each frame of voice data converted into the Mel spectrum domain into the Mel triangle filter bank, calculate the log energy of the Mel triangle filter output of each frequency band, and obtain the log energy sequence corresponding to each frame of voice data;
    Perform discrete cosine transform on each of the logarithmic energy sequences to obtain MFCC type voiceprint features corresponding to each frame of speech data;
    The MFCC type voiceprint feature is constructed as a voiceprint feature vector corresponding to each frame of speech data to form the first voiceprint feature.
  8. 根据权利要求6所述的声纹验证系统,其特征在于,所述判断结果包括所述第一声纹特征与所述预存声纹特征不相同,所述客户端服务器根据所述判断结果控制所述客户端进行反馈响应的过程,包括:The voiceprint verification system according to claim 6, wherein the judgment result includes that the first voiceprint feature is different from the pre-stored voiceprint feature, and the client server controls the location based on the judgment result The process of client feedback response includes:
    客户端服务器生成身份验证不成功的反馈信息并发送至所述客户端;The client server generates feedback information about unsuccessful authentication and sends it to the client;
    判断预设时间内根据所述第一声纹特征生成身份验证不成功的反馈信息的次数,是否超过预设次数。It is determined whether the number of times that the feedback information of unsuccessful identity verification is generated according to the first voiceprint feature within a preset time exceeds a preset number of times.
    若超过预设次数,则控制所述客户端处于禁用状态,并发出警报。If the preset times are exceeded, the client is controlled to be in a disabled state and an alarm is issued.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现声纹验证的方法,声纹验证的方法包括:A computer device includes a memory and a processor. The memory stores a computer program. The method is characterized in that the processor implements a voiceprint verification method when the computer program is executed. The voiceprint verification method includes:
    通过客户端服务器提取待验证身份的语音信号,并提取所述语音信号中各帧语音数据分别对应的MFCC类型声纹特征;Extract the voice signal of the identity to be verified through the client server, and extract the MFCC type voiceprint features corresponding to each frame of voice data in the voice signal;
    通过所述客户端服务器将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成第一声纹特征;Constructing the MFCC type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data through the client server to form a first voiceprint feature;
    声纹验证服务器接收所述客户端服务器发送的所述第一声纹特征;The voiceprint verification server receives the first voiceprint feature sent by the client server;
    声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求;The voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature meets the preset requirements;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。If satisfied, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
  10. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求9所述的计算机设备,其特征在于,所述声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求的步骤,包括:
    将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector;
    通过余弦距离公式
    Figure WO-DOC-FIGURE-8
    ,计算第一声纹特征对应的声纹鉴别向量i-vector与预存声纹特征对应的声纹鉴别向量i-vector之间的余弦距离值cos(x,y),其中,x代表预存声纹特征对应的声纹鉴别向量i-vector,y代表第一声纹特征对应的声纹鉴别向量i-vector;
    判断所述余弦距离值是否满足预设条件;
    若满足,则判定所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值满足预设要求,否则不满足预设要求。
    [Quote to join (Rules 20.5) 01.02.2019]
    The computer device according to claim 9, wherein the voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively Steps to meet preset requirements include:
    Map the voiceprint feature vectors corresponding to each frame of voice data to low-dimensional voiceprint identification vectors i-vector;
    Pass the cosine distance formula
    Figure WO-DOC-FIGURE-8
    , Calculate the cosine distance cos (x, y) between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the voiceprint discrimination vector i-vector corresponding to the pre-stored voiceprint feature, where x represents the pre-stored voiceprint The voiceprint discrimination vector i-vector corresponding to the feature, y represents the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature;
    Determine whether the cosine distance value meets the preset condition;
    If satisfied, it is determined that the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirements, otherwise the preset requirements are not met.
  11. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求10所述的计算机设备,其特征在于,所述将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector的步骤,包括:
    将提取得到的各帧语音数据分别对应的声纹特征向量分别输入到GMM-UBM模型,得到表征各帧语音数据在各高斯分量上的概率分布的高斯超向量;
    将各所述高斯超向量利用公式
    Figure WO-DOC-FIGURE-9
    ,计算得到各帧语音数据分别对应的低维度的声纹鉴别向量i-vector,其中
    Figure WO-DOC-FIGURE-10
    为各帧语音数据的高斯超向量,μ为所述GMM-UBM模型的均值超向量,T为各帧语音数据的低维度的声纹鉴别向量i-vector,
    Figure WO-DOC-FIGURE-11
    为映射到高维度的高斯空间的转换矩阵。
    [Quote to join (Rules 20.5) 01.02.2019]
    The computer device according to claim 10, wherein the step of mapping the voiceprint feature vectors corresponding to the voice data of each frame to the low-dimensional voiceprint discrimination vector i-vector includes:
    The voiceprint feature vectors corresponding to the extracted speech data of each frame are input into the GMM-UBM model, respectively, to obtain a Gaussian supervector representing the probability distribution of each frame of speech data on each Gaussian component;
    Use each Gaussian supervector with the formula
    Figure WO-DOC-FIGURE-9
    , The low-dimensional voiceprint discrimination vector i-vector corresponding to each frame of speech data is calculated, where
    Figure WO-DOC-FIGURE-10
    Is the Gaussian supervector of each frame of voice data, μ is the mean supervector of the GMM-UBM model, and T is the low-dimensional voiceprint discrimination vector i-vector of each frame of voice data,
    Figure WO-DOC-FIGURE-11
    It is a transformation matrix mapped into a high-dimensional Gaussian space.
  12. 根据权利要求10所述的计算机设备,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The computer device according to claim 10, wherein the step of determining whether the cosine distance value satisfies a preset condition includes:
    分别获取预存的多个人的声纹特征数据中各自对应的预存声纹特征与所述第一声纹特征之间的第一余弦距离值,其中,多个人的声纹特征数据中包括目标人的预存声纹特征;Obtaining the first cosine distance value between the pre-stored voiceprint features corresponding to each of the pre-stored voiceprint feature data of the multiple persons and the first voiceprint feature respectively, wherein the voiceprint feature data of the multiple persons includes the target person Pre-stored voiceprint features;
    将各所述第一余弦距离值按照从小到大的顺序进行排序;Sort the first cosine distance values in order from small to large;
    判断排序在前的预设数量的第一余弦距离值中,是否包括所述目标人的预存声纹特征对应的第一余弦距离值;Judging whether the preset first number of first cosine distance values includes the first cosine distance value corresponding to the target person's pre-stored voiceprint feature;
    若是,则判定所述第一余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the first cosine distance value meets a preset condition, otherwise, the preset condition is not met.
  13. 根据权利要求10所述的计算机设备,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The computer device according to claim 10, wherein the step of determining whether the cosine distance value satisfies a preset condition includes:
    获取目标人的预存声纹特征与所述第一声纹特征之间的第二余弦距离值;Acquiring a second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature;
    判断所述第二余弦距离值是否小于或等于预设阈值;Determine whether the second cosine distance value is less than or equal to a preset threshold;
    若是,则判定所述第二余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the second cosine distance value meets the preset condition, otherwise, the preset condition is not met.
  14. 一种计算机非易失性可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现声纹验证的方法,声纹验证的方法包括:A computer non-volatile readable storage medium, on which a computer program is stored, characterized in that, when the computer program is executed by a processor, a method of implementing voiceprint verification, the method of voiceprint verification includes:
    通过客户端服务器提取待验证身份的语音信号,并提取所述语音信号中各帧语音数据分别对应的MFCC类型声纹特征;Extract the voice signal of the identity to be verified through the client server, and extract the MFCC type voiceprint features corresponding to each frame of voice data in the voice signal;
    通过所述客户端服务器将所述MFCC类型声纹特征构建成各帧语音数据分别对应的声纹特征向量,以形成第一声纹特征;Constructing the MFCC type voiceprint feature into voiceprint feature vectors corresponding to each frame of voice data through the client server to form a first voiceprint feature;
    声纹验证服务器接收所述客户端服务器发送的所述第一声纹特征;The voiceprint verification server receives the first voiceprint feature sent by the client server;
    声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求;The voiceprint verification server judges whether the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature meets the preset requirements;
    若满足,则判定所述第一声纹特征与所述预存声纹特征相同,否则不相同。If satisfied, it is determined that the first voiceprint feature is the same as the pre-stored voiceprint feature, otherwise it is not the same.
  15. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求14所述的计算机非易失性可读存储介质,其特征在于,所述声纹验证服务器判断所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值是否满足预设要求的步骤,包括:
    将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector;
    通过余弦距离公式
    Figure WO-DOC-FIGURE-12
    ,计算第一声纹特征对应的声纹鉴别向量i-vector与预存声纹特征对应的声纹鉴别向量i-vector之间的余弦距离值cos(x,y),其中,x代表预存声纹特征对应的声纹鉴别向量i-vector,y代表第一声纹特征对应的声纹鉴别向量i-vector;
    判断所述余弦距离值是否满足预设条件;
    若满足,则判定所述第一声纹特征与预存声纹特征分别对应的声纹鉴别向量i-vector之间的特征距离值满足预设要求,否则不满足预设要求。
    [Quote to join (Rules 20.5) 01.02.2019]
    The computer non-volatile readable storage medium according to claim 14, wherein the voiceprint verification server judges a voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature, respectively The steps of whether the characteristic distance value between meet the preset requirements include:
    Map the voiceprint feature vectors corresponding to each frame of voice data to low-dimensional voiceprint identification vectors i-vector;
    Pass the cosine distance formula
    Figure WO-DOC-FIGURE-12
    , Calculate the cosine distance cos (x, y) between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the voiceprint discrimination vector i-vector corresponding to the pre-stored voiceprint feature, where x represents the pre-stored voiceprint The voiceprint discrimination vector i-vector corresponding to the feature, y represents the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature;
    Determine whether the cosine distance value meets the preset condition;
    If satisfied, it is determined that the feature distance value between the voiceprint discrimination vector i-vector corresponding to the first voiceprint feature and the pre-stored voiceprint feature respectively meets the preset requirements, otherwise the preset requirements are not met.
  16. [援引加入(细则20.5) 01.02.2019] 
    根据权利要求15所述的计算机非易失性可读存储介质,其特征在于,所述将各帧语音数据分别对应的声纹特征向量分别映射为低维度的声纹鉴别向量i-vector的步骤,包括:
    将提取得到的各帧语音数据分别对应的声纹特征向量分别输入到GMM-UBM模型,得到表征各帧语音数据在各高斯分量上的概率分布的高斯超向量;
    将各所述高斯超向量利用公式
    Figure WO-DOC-FIGURE-13
    ,计算得到各帧语音数据分别对应的低维度的声纹鉴别向量i-vector,其中
    Figure WO-DOC-FIGURE-14
    为各帧语音数据的高斯超向量,μ为所述GMM-UBM模型的均值超向量,T为各帧语音数据的低维度的声纹鉴别向量i-vector,
    Figure WO-DOC-FIGURE-15
     为映射到高维度的高斯空间的转换矩阵。
    [Quote to join (Rules 20.5) 01.02.2019]
    The computer non-volatile readable storage medium according to claim 15, wherein the step of mapping voiceprint feature vectors corresponding to each frame of voice data to low-dimensional voiceprint discrimination vectors i-vector, respectively ,include:
    The voiceprint feature vectors corresponding to the extracted speech data of each frame are input into the GMM-UBM model, respectively, to obtain a Gaussian supervector representing the probability distribution of each frame of speech data on each Gaussian component;
    Use each Gaussian supervector with the formula
    Figure WO-DOC-FIGURE-13
    , The low-dimensional voiceprint discrimination vector i-vector corresponding to each frame of speech data is calculated, where
    Figure WO-DOC-FIGURE-14
    Is the Gaussian supervector of each frame of voice data, μ is the mean supervector of the GMM-UBM model, and T is the low-dimensional voiceprint discrimination vector i-vector of each frame of voice data,
    Figure WO-DOC-FIGURE-15
    It is a transformation matrix mapped into a high-dimensional Gaussian space.
  17. 根据权利要求15所述的计算机非易失性可读存储介质,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The computer non-volatile readable storage medium according to claim 15, wherein the step of determining whether the cosine distance value meets a preset condition includes:
    分别获取预存的多个人的声纹特征数据中各自对应的预存声纹特征与所述第一声纹特征之间的第一余弦距离值,其中,多个人的声纹特征数据中包括目标人的预存声纹特征;Obtaining the first cosine distance value between the pre-stored voiceprint features corresponding to each of the pre-stored voiceprint feature data of the multiple persons and the first voiceprint feature respectively, wherein the voiceprint feature data of the multiple persons includes the target person Pre-stored voiceprint features;
    将各所述第一余弦距离值按照从小到大的顺序进行排序;Sort the first cosine distance values in order from small to large;
    判断排序在前的预设数量的第一余弦距离值中,是否包括所述目标人的预存声纹特征对应的第一余弦距离值;Judging whether the preset first number of first cosine distance values includes the first cosine distance value corresponding to the target person's pre-stored voiceprint feature;
    若是,则判定所述第一余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the first cosine distance value meets a preset condition, otherwise, the preset condition is not met.
  18. 根据权利要求15所述的计算机非易失性可读存储介质,其特征在于,所述判断所述余弦距离值是否满足预设条件的步骤,包括:The computer non-volatile readable storage medium according to claim 15, wherein the step of determining whether the cosine distance value meets a preset condition includes:
    获取目标人的预存声纹特征与所述第一声纹特征之间的第二余弦距离值;Acquiring a second cosine distance value between the pre-stored voiceprint feature of the target person and the first voiceprint feature;
    判断所述第二余弦距离值是否小于或等于预设阈值;Determine whether the second cosine distance value is less than or equal to a preset threshold;
    若是,则判定所述第二余弦距离值满足预设条件,否则不满足预设条件。If yes, it is determined that the second cosine distance value meets the preset condition, otherwise, the preset condition is not met.
PCT/CN2018/124402 2018-10-11 2019-02-01 Voiceprint verification method and apparatus, computer device and storage medium WO2020073519A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811184775.3A CN109257362A (en) 2018-10-11 2018-10-11 Method, apparatus, computer equipment and the storage medium of voice print verification
CN201811184775.3 2018-10-11

Publications (1)

Publication Number Publication Date
WO2020073519A1 true WO2020073519A1 (en) 2020-04-16

Family

ID=65046070

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124402 WO2020073519A1 (en) 2018-10-11 2019-02-01 Voiceprint verification method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN109257362A (en)
WO (1) WO2020073519A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109257362A (en) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of voice print verification
CN112687274A (en) * 2019-10-17 2021-04-20 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN111477251B (en) * 2020-05-21 2023-09-05 北京百度网讯科技有限公司 Model evaluation method and device and electronic equipment
CN111865926A (en) * 2020-06-24 2020-10-30 深圳壹账通智能科技有限公司 Call channel construction method and device based on double models and computer equipment
CN112509587B (en) * 2021-02-03 2021-04-30 南京大正智能科技有限公司 Method, device and equipment for dynamically matching mobile number and voiceprint and constructing index
CN112992152B (en) * 2021-04-22 2021-09-14 北京远鉴信息技术有限公司 Individual-soldier voiceprint recognition system and method, storage medium and electronic equipment
CN113366567A (en) * 2021-05-08 2021-09-07 腾讯音乐娱乐科技(深圳)有限公司 Voiceprint identification method, singer authentication method, electronic equipment and storage medium
CN114202891A (en) * 2021-12-28 2022-03-18 深圳市锐明技术股份有限公司 Method and device for sending alarm indication

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955677A (en) * 2014-05-12 2014-07-30 南京大学 Electrocardiogram recognizing method based on privacy protection
CN104680375A (en) * 2015-02-28 2015-06-03 优化科技(苏州)有限公司 Identification verifying system for living human body for electronic payment
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
US20180158464A1 (en) * 2013-07-17 2018-06-07 Verint Systems Ltd. Blind Diarization of Recorded Calls With Arbitrary Number of Speakers
CN109257362A (en) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of voice print verification

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203102365U (en) * 2012-12-28 2013-07-31 国民技术股份有限公司 Terminal and authentication apparatus
CN106098068B (en) * 2016-06-12 2019-07-16 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition
CN106991312B (en) * 2017-04-05 2020-01-10 百融云创科技股份有限公司 Internet anti-fraud authentication method based on voiceprint recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180158464A1 (en) * 2013-07-17 2018-06-07 Verint Systems Ltd. Blind Diarization of Recorded Calls With Arbitrary Number of Speakers
CN103955677A (en) * 2014-05-12 2014-07-30 南京大学 Electrocardiogram recognizing method based on privacy protection
CN104680375A (en) * 2015-02-28 2015-06-03 优化科技(苏州)有限公司 Identification verifying system for living human body for electronic payment
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
CN109257362A (en) * 2018-10-11 2019-01-22 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of voice print verification

Also Published As

Publication number Publication date
CN109257362A (en) 2019-01-22

Similar Documents

Publication Publication Date Title
WO2020073519A1 (en) Voiceprint verification method and apparatus, computer device and storage medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
WO2020073518A1 (en) Voiceprint verification method and apparatus, computer device, and storage medium
US10083693B2 (en) Method and system for using conversational biometrics and speaker identification/verification to filter voice streams
WO2019100606A1 (en) Electronic device, voiceprint-based identity verification method and system, and storage medium
CN108053318B (en) Method and device for identifying abnormal transactions
US9099085B2 (en) Voice authentication systems and methods
US20180047397A1 (en) Voice print identification portal
KR20180034507A (en) METHOD, APPARATUS AND SYSTEM FOR BUILDING USER GLONASS MODEL
WO2020224114A1 (en) Residual delay network-based speaker confirmation method and apparatus, device and medium
KR20190022432A (en) ELECTRONIC DEVICE, IDENTIFICATION METHOD, SYSTEM, AND COMPUTER READABLE STORAGE MEDIUM
CN109346086A (en) Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
CN109256138A (en) Auth method, terminal device and computer readable storage medium
WO2022126964A1 (en) Service data verification method and apparatus, device and storage medium
CN113886792A (en) Application method and system of print control instrument combining voiceprint recognition and face recognition
US11841932B2 (en) System and method for updating biometric evaluation systems
Poh et al. A biometric menagerie index for characterising template/model-specific variation
CN112201254A (en) Non-sensitive voice authentication method, device, equipment and storage medium
WO2021217979A1 (en) Voiceprint recognition method and apparatus, and device and storage medium
Zhang et al. Speech Perceptual Hashing Authentication Algorithm Based on Spectral Subtraction and Energy to Entropy Ratio.
WO2023078115A1 (en) Information verification method, and server and storage medium
EP4184355A1 (en) Methods and systems for training a machine learning model and authenticating a user with the model
TW202032536A (en) Speaker verification system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18936666

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18936666

Country of ref document: EP

Kind code of ref document: A1