CN110379433B - Identity authentication method and device, computer equipment and storage medium - Google Patents

Identity authentication method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110379433B
CN110379433B CN201910711306.0A CN201910711306A CN110379433B CN 110379433 B CN110379433 B CN 110379433B CN 201910711306 A CN201910711306 A CN 201910711306A CN 110379433 B CN110379433 B CN 110379433B
Authority
CN
China
Prior art keywords
sample
target
user
voice
voice frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910711306.0A
Other languages
Chinese (zh)
Other versions
CN110379433A (en
Inventor
刘加
刘艺
何亮
张卫强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co ltd
Tsinghua University
Original Assignee
Beijing Huacong Zhijia Technology Co ltd
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huacong Zhijia Technology Co ltd, Tsinghua University filed Critical Beijing Huacong Zhijia Technology Co ltd
Priority to CN201910711306.0A priority Critical patent/CN110379433B/en
Publication of CN110379433A publication Critical patent/CN110379433A/en
Application granted granted Critical
Publication of CN110379433B publication Critical patent/CN110379433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Abstract

The application relates to an identity authentication method, an identity authentication device, computer equipment and a storage medium. The method comprises the following steps: acquiring voice data input by a target user according to a target dynamic verification code; dividing the voice data into at least one voice frame according to a preset segmentation algorithm; aiming at each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm; inputting the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputting an intermediate user feature vector and a first posterior probability set corresponding to the voice frame; determining a first user characteristic vector corresponding to a target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm; and according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, performing identity authentication on the target user. By the method and the device, the calculation complexity of the server can be reduced, and the processing efficiency of the server is improved.

Description

Identity authentication method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of security technologies, and in particular, to a method and an apparatus for identity authentication, a computer device, and a storage medium.
Background
At present, two more common authentication techniques are an authentication technique based on biometric identification (such as fingerprint, face, voice, etc.) and an authentication technique based on dynamic verification code. In order to further improve the security of the identity authentication, a mode of combining the voice-based identity authentication technology and a dynamic verification code can be adopted to authenticate the identity of the user.
In a traditional combination mode, a user terminal collects voice data input by a user according to a dynamic verification code through voice collection equipment, and sends the voice data to a server with a voiceprint recognition model and a voice recognition model. After receiving the voice data, the server may input the voice data to the voiceprint recognition model, and output a target user feature vector corresponding to the user. Meanwhile, the server can also input the voice data into the voice recognition model and output a target text corresponding to the voice data. If the target user feature vector is similar to the pre-stored user feature vector of the user and the target text is the same as the dynamic verification code, the server may determine that the user is a valid user. Otherwise, the server may determine that the user is an illegitimate user.
Based on the traditional combination mode, the server needs two sets of voiceprint recognition models and voice recognition models with different deployment structures and parameters. The voiceprint recognition model and the voice recognition model need to independently process voice data of a user to obtain a user feature vector and a target text, so that the calculation complexity of the server is high, and the processing efficiency of the server is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an identity authentication method, apparatus, computer device and storage medium.
In a first aspect, a method of identity verification is provided, the method comprising:
acquiring voice data input by a target user according to a target dynamic verification code;
dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
aiming at each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm;
inputting the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputting a middle user feature vector corresponding to the voice frame and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units;
determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, performing identity verification on the target user.
As an optional implementation manner, the identity verification multitask model comprises a multitask shared hidden layer, a voiceprint recognition network and a voice recognition network;
the method for inputting the acoustic feature vector corresponding to the voice frame into the pre-trained identity verification multitask model and outputting the intermediate user feature vector corresponding to the voice frame and the first posterior probability set comprises the following steps:
inputting the acoustic characteristic vector corresponding to the voice frame to the multi-task shared hidden layer, and outputting an intermediate characteristic vector corresponding to the voice frame;
inputting the intermediate characteristic vector corresponding to the voice frame into the voice recognition network, and outputting the pronunciation characteristic vector corresponding to the voice frame and a first posterior probability set;
and inputting the intermediate characteristic vector and the pronunciation characteristic vector corresponding to the voice frame into the voiceprint recognition network, and outputting the intermediate user characteristic vector corresponding to the voice frame.
As an optional implementation manner, the performing identity verification on the target user according to the first user feature vector corresponding to the target user and the first posterior probability set corresponding to each voice frame includes:
determining a target dynamic verification code score corresponding to the target user according to the first posterior probability set corresponding to each voice frame;
if the similarity between the first user characteristic vector and a second user characteristic vector corresponding to the target user and stored in advance is greater than or equal to a preset similarity threshold value, and the target dynamic verification code number is greater than or equal to a preset dynamic verification code number threshold value, determining that the target user is a legal user;
and if the similarity between the first user characteristic vector and the second user characteristic vector is smaller than the preset similarity threshold, or the target dynamic verification code number is smaller than the preset dynamic verification code number threshold, determining that the target user is an illegal user.
As an optional implementation manner, the determining, according to the first posterior probability set corresponding to each speech frame, a target dynamic verification code score corresponding to the target user includes:
acquiring a pronunciation unit sequence corresponding to the target dynamic verification code;
determining a target pronunciation unit corresponding to each voice frame according to the first posterior probability set corresponding to each voice frame, the pronunciation unit sequence and a preset forced alignment algorithm;
aiming at each voice frame, determining the posterior probability of a target pronunciation unit corresponding to the voice frame in a first posterior probability set corresponding to the voice frame, and determining the product of the posterior probability of the target pronunciation unit and the pre-stored prior probability of the target pronunciation unit as the likelihood value of the target pronunciation unit;
and determining the target dynamic verification code score corresponding to the target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame.
As an optional implementation manner, the obtaining a pronunciation unit sequence corresponding to the target dynamic verification code includes:
determining a word set corresponding to the target dynamic verification code according to the target dynamic verification code and a preset word segmentation algorithm;
aiming at each word in the word set, determining a pronunciation unit sequence corresponding to the word according to a pre-stored correspondence between the word and the pronunciation unit sequence;
and sequencing the pronunciation unit sequence corresponding to each word according to the sequence of each word in the target dynamic verification code to obtain the pronunciation unit sequence corresponding to the target dynamic verification code.
As an optional implementation manner, the determining, according to the likelihood value of the target pronunciation unit corresponding to each speech frame, a target dynamic verification code score corresponding to the target user includes:
determining the difference value between the likelihood value of the target pronunciation unit corresponding to each voice frame and the maximum likelihood value in the likelihood values of the preset pronunciation units corresponding to the voice frame as the dynamic verification code score corresponding to the voice frame;
and determining the average value of the dynamic verification code scores corresponding to the voice frames as the target dynamic verification code score corresponding to the target user.
As an optional implementation, the method further comprises:
acquiring a pre-stored first training sample set, wherein the first training sample set comprises a plurality of sample user identifications and first sample voice data corresponding to each sample user identification;
for each first sample voice data in the first training sample set, dividing the first sample voice data into at least one first sample voice frame according to a preset segmentation algorithm;
extracting acoustic feature vectors corresponding to the first sample speech frames according to a preset acoustic feature extraction algorithm aiming at each first sample speech frame corresponding to the first sample speech data;
inputting the acoustic feature vector of each first sample voice frame corresponding to the first sample voice data into an identity verification multitask model to be trained, and outputting a second posterior probability set corresponding to the first sample voice data, wherein the second posterior probability set comprises posterior probabilities corresponding to user identifiers of the samples;
determining a first cost function corresponding to the first training sample set according to the posterior probability of the sample user identification corresponding to each first sample voice data;
and updating the parameters corresponding to the multitask sharing hidden layer, the parameters corresponding to the voiceprint recognition network and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the first cost function and a preset first parameter updating algorithm.
As an optional implementation, the method further comprises:
acquiring a pre-stored second training sample set, wherein the second training sample set comprises a plurality of second sample voice frames and a sample pronunciation unit corresponding to each second sample voice frame;
extracting an acoustic feature vector corresponding to each second sample speech frame in the second training sample set according to the preset acoustic feature extraction algorithm;
inputting the acoustic feature vector corresponding to the second sample voice frame into an identity verification multitask model to be trained, and outputting a third posterior probability set corresponding to the second sample voice frame, wherein the third posterior probability set comprises posterior probabilities corresponding to all sample pronunciation units;
determining a second cost function corresponding to the second training sample set according to the posterior probability of the sample pronunciation unit corresponding to each second sample voice frame;
and updating the parameters corresponding to the multitask sharing hidden layer and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the second cost function and a preset second parameter updating algorithm.
As an optional implementation, the method further comprises:
obtaining a plurality of pre-stored verification sample sets, wherein each verification sample set comprises a plurality of sample user identifications and second sample voice data corresponding to each sample user identification;
for each second sample voice data in each verification sample set, dividing the second sample voice data into at least one third sample voice frame according to a preset segmentation algorithm;
extracting an acoustic feature vector corresponding to each third sample speech frame corresponding to the second sample speech data according to a preset acoustic feature extraction algorithm;
inputting the acoustic feature vector of each third sample voice frame corresponding to the second sample voice data into an identity verification multitask model to be verified, and outputting a fourth posterior probability set corresponding to the second sample voice data, wherein the fourth posterior probability set comprises posterior probabilities corresponding to user identifiers of the samples;
if the posterior probability of the sample user identifier corresponding to the second sample voice data is the maximum value in the fourth posterior probability set corresponding to the second sample voice data, determining the second sample voice data as the target sample voice data;
determining the ratio of the number of target sample voice data in the verification sample set to the total number of second sample voice data in the verification sample set as the accuracy of the verification sample set;
and determining the change rate of the accuracy rate corresponding to each verification sample set according to the accuracy rate of each verification sample set, and if the change rate of the accuracy rate corresponding to the continuous preset number of verification sample sets is less than or equal to a preset change rate threshold value, determining that the identity verification multitask model to be verified is trained.
In a second aspect, there is provided an apparatus for identity verification, the apparatus comprising:
the first acquisition module is used for acquiring voice data input by a target user according to the target dynamic verification code;
the first division module is used for dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
the first extraction module is used for extracting an acoustic feature vector corresponding to each voice frame according to a preset acoustic feature extraction algorithm;
the first output module is used for inputting the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model and outputting a middle user feature vector corresponding to the voice frame and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units;
the first determining module is used for determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and the verification module is used for verifying the identity of the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame.
In a third aspect, a computer device is provided, which includes a memory and a processor, the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the following steps:
acquiring voice data input by a target user according to a target dynamic verification code;
dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
aiming at each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm;
inputting the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputting a middle user feature vector corresponding to the voice frame and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units;
determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, performing identity verification on the target user.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring voice data input by a target user according to a target dynamic verification code;
dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
aiming at each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm;
inputting the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputting a middle user feature vector corresponding to the voice frame and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units;
determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, performing identity verification on the target user.
The embodiment of the application provides an identity authentication method and device, computer equipment and a storage medium. The server acquires voice data input by a target user according to the target dynamic verification code, and divides the voice data into at least one voice frame according to a preset segmentation algorithm. Then, aiming at each voice frame, the server determines the acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm. And then, the server inputs the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputs the intermediate user feature vector corresponding to the voice frame and a first posterior probability set. The first posterior probability set comprises posterior probabilities corresponding to the preset pronunciation units. And finally, the server determines a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm, and performs identity verification on the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame. Therefore, the server can process the voice data of the user without deploying two sets of voiceprint recognition models and voice recognition models with different structures and parameters and only deploying one set of identity verification multitask model, so that the calculation complexity of the server is reduced, and the processing efficiency of the server is improved.
Drawings
Fig. 1 is an architecture diagram of an authentication system according to an embodiment of the present application;
fig. 2 is a flowchart of an identity verification method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an identity verification multitasking model provided in an embodiment of the present application;
fig. 4 is a flowchart of an identity verification method according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for determining a target dynamic captcha score according to an embodiment of the present disclosure;
fig. 6 is a flowchart of an account registration method according to an embodiment of the present disclosure;
FIG. 7 is a flowchart of a training method for an identity verification multitask model according to an embodiment of the present application;
fig. 8 is a flowchart of a verification method of an identity verification multitask model according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an authentication apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The embodiment of the application provides an identity authentication method which can be applied to an identity authentication system. Fig. 1 is an architecture diagram of an authentication system according to an embodiment of the present application. As shown in fig. 1, the authentication system includes a user terminal and a server connected through a communication network. The communication network may be a wired network, a wireless network, or other types of communication networks, and the embodiments of the present application are not limited; the user terminal may be a portable electronic device with recording and computing functions, such as a mobile phone, a tablet computer, a notebook computer, and the like.
The system comprises a user terminal, a server and a server, wherein the user terminal is used for receiving an account registration request and an identity verification request of a user, generating a dynamic verification code corresponding to the user, collecting voice data input by the user according to the dynamic verification code, and sending the account, the dynamic verification code and the voice data corresponding to the user to the server; the server is used for training and verifying the identity verification multitask model to be trained; in the account registration process, the server is further configured to divide voice data into at least one voice frame according to a preset segmentation algorithm, input an acoustic feature vector corresponding to each voice frame into the identity authentication multitask model, output an intermediate user feature vector corresponding to each voice frame, obtain a target user feature vector corresponding to the user according to the intermediate user feature vector corresponding to each voice frame and a preset pooling algorithm, and store the account and the target user feature vector corresponding to the user; in the process of identity authentication, the server is further configured to divide voice data into at least one voice frame according to a preset segmentation algorithm, input an acoustic feature vector corresponding to each voice frame into the identity authentication multitask model, output an intermediate user feature vector and a first posterior probability set corresponding to each voice frame, obtain a target user feature vector corresponding to the user according to the intermediate user feature vector corresponding to each voice frame and a preset pooling algorithm, obtain a target dynamic authentication code score according to the first posterior probability set corresponding to each voice frame, and authenticate the user according to the target user feature vector corresponding to the user and the target dynamic authentication code score.
As an optional implementation manner, the server may further send the authentication multitasking model to the user terminal, and the user terminal processes the voice data input by the user based on the authentication multitasking model. The embodiment of the application is described by taking the example that the server processes the voice data input by the user based on the identity authentication multitask model, and other situations are similar to the above.
The following describes a method for user authentication provided in an embodiment of the present application in detail with reference to specific embodiments. As shown in fig. 2, the specific steps are as follows:
step 201, acquiring voice data input by a target user according to a target dynamic verification code.
In implementation, when a user (i.e., a target user) logs in a user terminal by using a target account corresponding to the target user, the user terminal may generate a target dynamic verification code corresponding to the target account. The user terminal can randomly select a candidate dynamic verification code from a pre-stored candidate dynamic verification code set as a target dynamic verification code corresponding to the target account; the user terminal can also randomly select a preset number of candidate words from a pre-stored candidate word set, and randomly combine the selected preset number of candidate words to serve as a target dynamic verification code corresponding to the target account; the user terminal may also generate a target dynamic verification code corresponding to the target account in other manners, which is not limited in this embodiment of the application. After the user terminal generates the target dynamic verification code corresponding to the target account, the target dynamic verification code can be displayed in a display interface. As an optional implementation manner, the user terminal may further display a prompt message for prompting the target user to read the target dynamic verification code in the display interface. The user terminal may then activate a voice capture device (e.g., a microphone) to capture voice data of the target user reading the target dynamic verification code. The user terminal can send an identity authentication request to the server after obtaining the voice data input by the target user according to the target dynamic authentication code. The identity authentication request carries a target account number, a target dynamic authentication code and voice data. After receiving the authentication request, the server can analyze the authentication request to obtain a target account number, a target dynamic verification code and voice data carried in the authentication request.
Step 202, dividing the voice data into at least one voice frame according to a preset segmentation algorithm.
The server may store a preset segmentation algorithm in advance. The segmentation algorithm may be a frame windowing algorithm, or may be other types of segmentation algorithms, and the embodiment of the present application is not limited. After the server obtains the voice data, the voice data can be divided into at least one voice frame according to a pre-stored segmentation algorithm. Wherein the voice frame is a short-time voice segment with a duration of about 25 milliseconds.
Step 203, extracting an acoustic feature vector corresponding to each speech frame according to a preset acoustic feature extraction algorithm.
In implementation, the server may store the acoustic feature extraction algorithm in advance. For each voice frame, the server can extract the acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm. The acoustic feature vector may be a mel-frequency cepstrum coefficient, etc.
Step 203, inputting the acoustic feature vector corresponding to the voice frame to a pre-trained identity verification multitask model, and outputting the intermediate user feature vector corresponding to the voice frame and a first posterior probability set.
The first posterior probability set comprises posterior probabilities corresponding to the preset pronunciation units.
In implementation, the server may store a pre-trained authentication multitask model in advance. For each voice frame in the voice frame set, after obtaining the acoustic feature vector corresponding to the voice frame, the server may input the acoustic feature vector corresponding to the voice frame to the identity authentication multitask model. The identity authentication multitask model can output the intermediate user characteristic vector and the first posterior probability set corresponding to the voice frame. The first posterior probability set comprises posterior probabilities corresponding to preset pronunciation units in the identity verification multitask model.
Fig. 3 is a schematic structural diagram of an identity verification multitasking model according to an embodiment of the present application. As shown in FIG. 3, the identity verification multitasking model comprises a multitasking shared hidden layer, a voiceprint recognition network and a voice recognition network. The server inputs the acoustic feature vector corresponding to each voice frame in the voice frame set to a pre-trained identity verification multitask model, and the processing process of outputting the intermediate user feature vector corresponding to the voice frame and the first posterior probability set is as follows:
step one, inputting the acoustic characteristic vector corresponding to the voice frame to a multitask shared hidden layer, and outputting an intermediate characteristic vector corresponding to the voice frame.
In implementation, for each voice frame in the set of voice frames, the server may input the acoustic feature vector corresponding to the voice frame to the multitask sharing hidden layer in the authentication multitask model. The multitask shared hidden layer can output the intermediate feature vector corresponding to the voice frame.
And step two, inputting the intermediate characteristic vector corresponding to the voice frame into a voice recognition network, and outputting the pronunciation characteristic vector corresponding to the voice frame and the first posterior probability set.
In implementation, after obtaining the intermediate feature vector corresponding to the voice frame, the server may input the intermediate feature vector corresponding to the voice frame to the voice recognition network. The speech recognition network can output the pronunciation feature vector and the first posterior probability set corresponding to the speech frame.
And step three, inputting the intermediate characteristic vector and the pronunciation characteristic vector corresponding to the voice frame into a voiceprint recognition network, and outputting the intermediate user characteristic vector corresponding to the voice frame.
In implementation, after obtaining the intermediate feature vector and the pronunciation feature vector corresponding to the voice frame, the server may input the intermediate feature vector and the pronunciation feature vector corresponding to the voice frame to the voiceprint recognition network. The voiceprint recognition network can output the intermediate user characteristic vector corresponding to the voice frame.
In the embodiment of the application, compared with the traditional voiceprint recognition model, the server inputs the intermediate characteristic vector corresponding to the voice frame output by the multi-task sharing hidden layer and the pronunciation characteristic vector corresponding to the voice frame output by the voice recognition network into the voiceprint recognition network together, so that the accuracy of the user characteristic vector corresponding to the voice frame output by the voiceprint recognition network is improved.
Step 205, determining a first user feature vector corresponding to the target user according to the intermediate user feature vector corresponding to each voice frame and a preset pooling algorithm.
In an implementation, the pooling algorithm may be pre-stored in the server. After the server obtains the intermediate user feature vectors corresponding to the voice frames, the intermediate user feature vectors corresponding to the voice frames can be subjected to pooling processing according to a preset pooling algorithm, and first user feature vectors corresponding to target users are obtained.
And step 206, performing identity verification on the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame.
In implementation, after the server obtains the first user feature vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, the server may perform identity authentication on the target user according to the first user feature vector corresponding to the target user and the first posterior probability set corresponding to each voice frame. Therefore, the server can process the voice data of the user without deploying two sets of voiceprint recognition models and voice recognition models with different structures and parameters and only deploying one set of identity verification multitask model, so that the calculation complexity of the server is reduced, and the processing efficiency of the server is improved.
As shown in fig. 4, the server performs an identity verification process on the target user according to the first user feature vector corresponding to the target user and the first posterior probability set corresponding to each voice frame as follows:
step 401, determining a target dynamic verification code score corresponding to a target user according to the first posterior probability set corresponding to each voice frame.
In implementation, after the server obtains the first posterior probability set corresponding to each voice frame, the server may further determine the target dynamic verification code score corresponding to the target user according to the first posterior probability set corresponding to each voice frame. After obtaining the first user feature vector and the target dynamic verification code score corresponding to the target user, the server may further determine whether the first user feature vector is the same as a second user feature vector corresponding to the target user (i.e., a target account number carried in the authentication request) stored in advance, and whether the target dynamic verification code score is greater than or equal to a preset dynamic verification code score threshold. If the similarity between the first user feature vector and the second user feature vector corresponding to the pre-stored target user is greater than or equal to the preset similarity threshold, and the target dynamic verification code score is greater than or equal to the preset dynamic verification code score threshold, step 402 is executed. If the similarity between the first user feature vector and the second user feature vector is smaller than the preset similarity threshold, or the target dynamic verification code number is smaller than the preset dynamic verification code number threshold, step 403 is executed. The similarity may be an euclidean similarity, a cosine similarity, or other types of similarities, which is not limited in the embodiments of the present application.
Fig. 5 is a flowchart of a method for determining a target dynamic identifying code score according to an embodiment of the present application, and as shown in fig. 5, a processing procedure of determining a target dynamic identifying code score corresponding to a target user according to a first posterior probability set corresponding to each speech frame by a server is as follows:
step 501, obtaining a pronunciation unit sequence corresponding to a target dynamic verification code.
In implementation, after receiving the authentication request, the server may parse the authentication request to obtain a target dynamic verification code carried in the authentication request. Then, the server can further acquire the pronunciation unit sequence corresponding to the target dynamic verification code. The processing process of the server for acquiring the pronunciation unit sequence corresponding to the target dynamic verification code is as follows:
step one, determining a word set corresponding to a target dynamic verification code according to the target dynamic verification code and a preset word segmentation algorithm.
In implementation, the server may store the word segmentation algorithm in advance. After the server obtains the target dynamic verification code, word segmentation processing can be performed on the target dynamic verification code according to a preset word segmentation algorithm, so that a word set corresponding to the target dynamic verification code is obtained. For example, the target dynamic verification code is "Qinghua university", and the server performs word segmentation processing on the target dynamic verification code to obtain a word set { "Qinghua", "university" } corresponding to the target dynamic verification code.
And step two, aiming at each word in the word set, determining a pronunciation unit sequence corresponding to the word according to the corresponding relation between the pre-stored word and the pronunciation unit sequence.
In an implementation, the server may store a correspondence relationship between a word and a pronunciation unit sequence (which may also be referred to as a pronunciation dictionary) in advance. The first table is a correspondence between words and pronunciation unit sequences stored in the server in advance, and is shown in the first table.
Watch 1
Serial number Word Sequence of pronunciation units
1 Beijing b ei3 j ing1
2 Clearing away heat q ing1 h ua2
3 University d a4 x ve2
4 Space flight h ang2 t ian1
After the server obtains the word set corresponding to the target dynamic verification code, for each word in the word set, the server may determine the pronunciation unit sequence corresponding to the word according to the pre-stored correspondence between the word and the pronunciation unit sequence.
And thirdly, sequencing the pronunciation unit sequence corresponding to each word according to the sequence of each word in the target dynamic verification code to obtain the pronunciation unit sequence corresponding to the target dynamic verification code.
In implementation, after obtaining the pronunciation unit sequence corresponding to each word, the server may sort the pronunciation unit sequences corresponding to each word according to the order of each word in the target dynamic verification code, so as to obtain the pronunciation unit sequence corresponding to the target dynamic verification code. For example, the target dynamic verification code is "Qinghua university", and the pronunciation unit sequence corresponding to the target dynamic verification code is "q ing1 h ua2 d a4 x ve 2".
Step 502, determining a target pronunciation unit corresponding to each speech frame according to the first posterior probability set, the pronunciation unit sequence and a preset forced alignment algorithm corresponding to each speech frame.
In implementation, the server may store a forced alignment algorithm in advance. The forced alignment algorithm may be a viterbi algorithm, or may be other types of forced alignment algorithms, and the embodiment of the present application is not limited. After the server obtains the first posterior probability set corresponding to each voice frame and the pronunciation unit sequence corresponding to the target dynamic verification code, the server can forcibly align each voice frame with the pronunciation unit sequence, namely obtain the corresponding start time and end time of each voice frame in the pronunciation unit sequence, thereby obtaining the target pronunciation unit corresponding to each voice frame.
Step 503, for each speech frame, determining a posterior probability of a target pronunciation unit corresponding to the speech frame in the first posterior probability set corresponding to the speech frame, and determining a product of the posterior probability of the target pronunciation unit and a pre-stored prior probability of the target pronunciation unit as a likelihood value of the target pronunciation unit.
In implementation, the server may store the prior probability of each preset pronunciation unit in advance. After the server obtains the target pronunciation unit corresponding to each voice frame, for each voice frame, the server may determine the posterior probability of the target pronunciation unit corresponding to the voice frame in the first posterior probability set corresponding to the voice frame. The server may then determine a prior probability for the target pronunciation unit among the pre-stored prior probabilities for each of the pre-set pronunciation units. Then, the server can calculate the product of the posterior probability of the target pronunciation unit and the prior probability of the target pronunciation unit to obtain the likelihood value of the target pronunciation unit.
And step 504, determining a target dynamic verification code score corresponding to a target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame.
In implementation, after obtaining the likelihood value of the target pronunciation unit corresponding to each voice frame, the server may determine the target dynamic verification code score corresponding to the target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame. The server determines the target dynamic verification code score corresponding to the target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame, and the processing process of the server is as follows:
step one, aiming at each voice frame, determining the difference value of the likelihood value of the target pronunciation unit corresponding to the voice frame and the maximum likelihood value in the likelihood values of the preset pronunciation units corresponding to the voice frame as the dynamic verification code score corresponding to the voice frame.
In implementation, after the server obtains the likelihood values of the target pronunciation unit corresponding to the speech frame, the server may determine the maximum likelihood value from the likelihood values of each preset pronunciation unit corresponding to the speech frame. Then, for each voice frame, the server may calculate a difference between a likelihood value of a target pronunciation unit corresponding to the voice frame and the maximum likelihood value, to obtain a dynamic verification code score corresponding to the voice frame.
And step two, determining the average value of the dynamic verification code scores corresponding to the voice frames as the target dynamic verification code score corresponding to the target user.
In implementation, after the server obtains the dynamic verification code scores corresponding to the voice frames, an average value of the dynamic verification code scores corresponding to the voice frames can be calculated and used as a target dynamic verification code score corresponding to a target user.
The traditional speech recognition model outputs text content corresponding to speech data in a decoding mode, and the decoding mode needs to consider the interrelationship between acoustic feature vectors corresponding to the speech data and pronunciation units, between pronunciation units and words and between words and words, and needs to perform decoding operation in a larger candidate word list to obtain the most probable text content corresponding to the speech data, so that the traditional speech recognition model has higher computational complexity. In the embodiment of the application, after the server obtains the posterior probability of each preset pronunciation unit corresponding to each voice frame, the server determines the target pronunciation unit corresponding to each voice frame according to the pronunciation unit sequence corresponding to the dynamic verification code and the forced alignment algorithm, and determines the likelihood value of the target pronunciation unit corresponding to each voice frame. Then, the server can determine the target dynamic verification code score according to the likelihood value of the target pronunciation unit corresponding to each voice frame, thereby reducing the calculation complexity.
Step 402, determining the target user as a legal user.
In implementation, if the similarity between the first user feature vector and the second user feature vector corresponding to the pre-stored target user is greater than or equal to the preset similarity threshold, and the target dynamic verification code score is greater than or equal to the preset dynamic verification code score threshold, the server may determine that the target user is a valid user. The server may then send an authentication success response to the user terminal. And after receiving the identity authentication success response, the user terminal allows the target user to log in the user terminal by using the target account.
And step 403, determining that the target user is an illegal user.
In implementation, if the similarity between the first user feature vector and the second user feature vector is smaller than a preset similarity threshold, or the target dynamic verification code number is smaller than a preset dynamic verification code number threshold, the server determines that the target user is an illegal user. The server may then send an authentication failure response to the user terminal. And after receiving the identity authentication failure response, the user terminal refuses the target user to use the target account to log in the user terminal.
The embodiment of the application provides an identity authentication method. The server acquires voice data input by a target user according to the target dynamic verification code, and divides the voice data into at least one voice frame according to a preset segmentation algorithm. Then, aiming at each voice frame, the server determines the acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm. And then, the server inputs the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputs the intermediate user feature vector corresponding to the voice frame and a first posterior probability set. The first posterior probability set comprises posterior probabilities corresponding to the preset pronunciation units. And finally, the server determines a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm, and performs identity verification on the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame. Therefore, the server can process the voice data of the user without deploying two sets of voiceprint recognition models and voice recognition models with different structures and parameters and only deploying one set of identity verification multitask model, so that the calculation complexity of the server is reduced, and the processing efficiency of the server is improved.
The embodiment of the present application further provides an account registration method, as shown in fig. 6, the specific processing procedure is as follows:
step 601, acquiring voice data input by a target user according to a target dynamic verification code.
In implementation, when a user (i.e., a target user) creates a target account corresponding to a login user terminal of the target user, the target user may input the target account of the target user in an account input box in an account registration interface of the user terminal. Then, the user terminal may generate a target dynamic verification code corresponding to the target account. The processing procedure of the user terminal generating the target dynamic verification code corresponding to the target account is similar to the processing procedure of the user terminal generating the target dynamic verification code corresponding to the target account in step 201, and is not repeated here. After the user terminal generates the target dynamic verification code corresponding to the target account, the target dynamic verification code can be displayed in a display interface. As an optional implementation manner, the user terminal may further display a prompt message for prompting the target user to read the target dynamic verification code in the display interface. The user terminal may then activate a voice capture device (e.g., a microphone) to capture voice data of the target user reading the target dynamic verification code. And after the user terminal obtains the voice data input by the target user according to the target dynamic verification code, an account registration request can be sent to the server. The account registration request carries a target account, a target dynamic verification code and voice data. After receiving the account registration request, the server can analyze the account registration request to obtain a target account, a target dynamic verification code and voice data carried in the account registration request.
Step 602, dividing the voice data into at least one voice frame according to a preset segmentation algorithm.
Step 603, for each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm.
Step 604, inputting the acoustic feature vector corresponding to the voice frame to a pre-trained identity authentication multi-task model, and outputting the intermediate user feature vector corresponding to the voice frame.
Step 605, determining a target user feature vector corresponding to the target user according to the intermediate user feature vector corresponding to each voice frame and a preset pooling algorithm.
In the implementation, the processing procedure from step 602 to step 605 is similar to the processing procedure of the server determining the first user feature vector corresponding to the target user in the above-mentioned steps of the identity verification method, and is not described herein again.
Step 606, storing the corresponding relation between the target account and the target user feature vector.
In implementation, after obtaining the target user feature vector corresponding to the target user, the server may store the corresponding relationship between the target account corresponding to the target user and the target user feature vector to the local, so as to perform identity authentication on the target user.
The embodiment of the present application further provides a training method for an identity verification multitask model, as shown in fig. 7, the specific processing procedure is as follows:
step 701, initializing an identity verification multitask model to be trained.
In implementation, the server may store the authentication multitask model to be trained in advance. When the server needs to train the authentication multitask model to be trained, the server can randomly initialize parameters in the authentication multitask model to be trained.
Step 702A, a first set of pre-stored training samples is obtained.
The first training sample set comprises a plurality of sample user identifications and first sample voice data corresponding to each sample user identification.
Step 702B, a second set of pre-stored training samples is obtained.
The second training sample set comprises a plurality of second sample speech frames and a sample pronunciation unit corresponding to each second sample speech frame.
In an implementation, the server may store a plurality of first training sample sets and a plurality of second training sample sets in advance. The first training sample set comprises a plurality of sample user identifications and first sample voice data corresponding to each sample user identification; the second training sample set comprises a plurality of second sample speech frames and a sample pronunciation unit corresponding to each second sample speech frame. After initializing parameters in the identity verification multitask model to be trained, the server can obtain a first training sample set with a first preset number and a second training sample set with a second preset number, which are stored in advance.
Step 703, for each first sample speech data in the first training sample set, dividing the first sample speech data into at least one first sample speech frame according to a preset segmentation algorithm.
In implementation, the processing procedure of step 703 is similar to the processing procedure of step 202, and is not described herein again.
Step 704A, for each first sample speech frame corresponding to the first sample speech data, extracting an acoustic feature vector corresponding to the first sample speech frame according to a preset acoustic feature extraction algorithm.
Step 704B, extracting, according to a preset acoustic feature extraction algorithm, an acoustic feature vector corresponding to each second sample speech frame in the second training sample set.
In implementation, the processing procedure of step 704A and step 704B is similar to the processing procedure of step 203, and is not described herein again.
Step 705A, inputting the acoustic feature vector of each first sample speech frame corresponding to the first sample speech data into the identity verification multitask model to be trained, and outputting a second posterior probability set corresponding to the first sample speech data.
And the second posterior probability set comprises posterior probabilities corresponding to the user identifications of the samples.
In implementation, after obtaining the acoustic feature vector of each first sample speech frame corresponding to the first sample speech data, the server may input the acoustic feature vector of each first sample speech frame corresponding to the first sample speech data to the identity verification multitask model. The identity verification multitask model can output a second posterior probability set corresponding to the first sample voice data. And the second posterior probability set comprises posterior probabilities corresponding to the user identifications of the samples.
Step 705B, inputting the acoustic feature vector corresponding to the second sample speech frame to the identity verification multitask model to be trained, and outputting a third posterior probability set corresponding to the second sample speech frame.
And the third posterior probability set comprises posterior probabilities corresponding to the sample pronunciation units.
In implementation, for each second sample speech frame, after obtaining the acoustic feature vector corresponding to the second sample speech frame, the server may input the acoustic feature vector corresponding to the second sample speech frame to the identity verification multitask model. The identity verification multitask model can output a third posterior probability set corresponding to the second sample voice frame. And the third posterior probability set comprises posterior probabilities corresponding to the sample pronunciation units.
Step 706A, determining a first cost function corresponding to the first training sample set according to the posterior probability of the sample user identifier corresponding to each first sample voice data.
In an implementation, after the server obtains the second posterior probability sets corresponding to the first sample voice data, for each first sample voice data, the server may determine the posterior probability of the sample user identifier corresponding to the first sample voice data in the second posterior probability set corresponding to the first sample voice data. Then, the server may determine a first cost function corresponding to the first training sample set according to the posterior probability of the sample user identifier corresponding to each first sample voice data.
Step 706B, according to the posterior probability of the sample pronunciation unit corresponding to each second sample speech frame, determining a second cost function corresponding to the second training sample set.
In implementation, after the server obtains the third posterior probability set corresponding to each second sample speech frame, for each second sample speech frame, the server may determine the posterior probability of the sample pronunciation unit corresponding to the second sample speech frame in the third posterior probability set corresponding to the second sample speech frame. Then, the server may determine a second cost function corresponding to the second training sample set according to the posterior probability of the sample pronunciation unit corresponding to each second sample speech frame.
And 707A, updating a parameter corresponding to the multitask sharing hidden layer, a parameter corresponding to the voiceprint recognition network and a parameter corresponding to the voice recognition network in the identity verification multitask model to be trained according to the first cost function and a preset first parameter updating algorithm.
In an implementation, the server may store the first parameter updating algorithm in advance. The first parameter updating algorithm may be a random gradient descent method. After the server obtains the first cost function, the server can update the parameters corresponding to the multitask sharing hidden layer, the parameters corresponding to the voiceprint recognition network and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the first cost function and the first parameter updating algorithm.
And step 707B, updating the parameters corresponding to the multitask sharing hidden layer and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the second cost function and a preset second parameter updating algorithm.
In an implementation, the server may store the second parameter updating algorithm in advance. The second parameter updating algorithm may be a random gradient descent method. After the server obtains the second cost function, the server can update the parameters corresponding to the multitask sharing hidden layer and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the second cost function and the second parameter updating algorithm.
The embodiment of the present application further provides a verification method for an identity verification multitask model, as shown in fig. 8, the specific processing procedure is as follows:
step 801, a plurality of pre-stored verification sample sets are obtained.
Wherein each verification sample set comprises a plurality of sample user identifications and each sample user identification corresponds to second sample voice data.
In implementation, a plurality of verification sample sets may be stored in advance in the server. Each verification sample set comprises a plurality of sample user identifications and each sample user identification corresponds to a third sample speech frame, and the plurality of sample user identifications are the sample user identifications in the first training sample set. When the server needs to verify the identity verification multitask model to be verified, the server can obtain a plurality of pre-stored verification sample sets.
Step 802, for each second sample voice data in each verification sample set, dividing the second sample voice data into at least one third sample voice frame according to a preset segmentation algorithm.
In the implementation, the processing procedure of step 802 is similar to the processing procedure of step 703, and is not described herein again.
Step 803, for each third sample speech frame corresponding to the second sample speech data, extracting an acoustic feature vector corresponding to the third sample speech frame according to a preset acoustic feature extraction algorithm.
In the implementation, the processing procedure of step 803 is similar to the processing procedure of step 704A and step 704B, and is not described herein again.
Step 804, inputting the acoustic feature vector of each third sample speech frame corresponding to the second sample speech data into the identity verification multitask model to be verified, and outputting a fourth posterior probability set corresponding to the second sample speech data.
And the fourth posterior probability set comprises posterior probabilities corresponding to the user identifications of all samples.
In practice, the processing procedure of step 804 is similar to that of step 705A, and is not described herein again.
Step 805, if the posterior probability of the sample user identifier corresponding to the second sample voice data is the maximum value in the fourth posterior probability set corresponding to the second sample voice data, determining that the second sample voice data is the target sample voice data.
In implementation, after obtaining the fourth posterior probability set corresponding to the second sample voice data, the server may determine whether the posterior probability of the sample user identifier corresponding to the second sample voice data is the maximum value in the fourth posterior probability set. If the posterior probability of the sample user identifier corresponding to the second sample voice data is the maximum value, the server may determine the second sample voice data as the target sample voice data.
Step 806, determining a ratio of the number of target sample voice data in the verification sample set to the total number of second sample voice data in the verification sample set as the accuracy of the verification sample set.
In implementation, after the server determines each target sample voice data in the verification sample set for each verification sample set, the server may further calculate a ratio of the number of the target sample voice data in the verification sample set to the total number of the second sample voice data in the verification sample set, so as to obtain the accuracy of the verification sample set.
And 807, determining the change rate of the accuracy corresponding to each verification sample set according to the accuracy of each verification sample set, and if the change rate of the accuracy corresponding to the continuous preset number of verification sample sets is less than or equal to a preset change rate threshold, determining that the training of the identity verification multitask model to be verified is finished.
In implementation, after the server obtains the accuracy of each verification sample set, the server may calculate the change rate corresponding to each verification sample set. If the change rate of the continuous preset number of verification sample sets is smaller than or equal to the preset change rate threshold value, the server can determine that the identity verification multitask model to be verified is trained completely.
An embodiment of the present application further provides an identity authentication apparatus, as shown in fig. 9, the apparatus includes:
a first obtaining module 910, configured to obtain voice data input by a target user according to a target dynamic verification code;
a first dividing module 920, configured to divide voice data into at least one voice frame according to a preset segmentation algorithm;
a first extraction module 930, configured to, for each speech frame, extract an acoustic feature vector corresponding to the speech frame according to a preset acoustic feature extraction algorithm;
a first output module 940, configured to input the acoustic feature vector corresponding to the speech frame into a pre-trained identity verification multitask model, and output an intermediate user feature vector and a first posterior probability set corresponding to the speech frame, where the first posterior probability set includes posterior probabilities corresponding to the preset pronunciation units;
a first determining module 950, configured to determine a first user feature vector corresponding to a target user according to the intermediate user feature vector corresponding to each speech frame and a preset pooling algorithm;
the verifying module 960 is configured to perform identity verification on the target user according to the first user feature vector corresponding to the target user and the first posterior probability set corresponding to each voice frame.
As an optional implementation manner, the identity verification multitask model comprises a multitask shared hidden layer, a voiceprint recognition network and a voice recognition network;
the first output module 940 is specifically configured to:
inputting the acoustic characteristic vector corresponding to the voice frame to a multitask shared hidden layer, and outputting an intermediate characteristic vector corresponding to the voice frame;
inputting the intermediate characteristic vector corresponding to the voice frame into a voice recognition network, and outputting the pronunciation characteristic vector corresponding to the voice frame and a first posterior probability set;
and inputting the intermediate characteristic vector and the pronunciation characteristic vector corresponding to the voice frame into a voiceprint recognition network, and outputting the intermediate user characteristic vector corresponding to the voice frame.
As an optional implementation, the verification module 960 is specifically configured to:
determining a target dynamic verification code score corresponding to a target user according to a first posterior probability set corresponding to each voice frame;
if the similarity between the first user characteristic vector and a second user characteristic vector corresponding to a pre-stored target user is greater than or equal to a preset similarity threshold, and the target dynamic verification code score is greater than or equal to a preset dynamic verification code score threshold, determining that the target user is a legal user;
and if the similarity between the first user characteristic vector and the second user characteristic vector is smaller than a preset similarity threshold, or the target dynamic verification code number is smaller than a preset dynamic verification code number threshold, determining that the target user is an illegal user.
As an optional implementation, the verification module 960 is specifically configured to:
acquiring a pronunciation unit sequence corresponding to the target dynamic verification code;
determining a target pronunciation unit corresponding to each voice frame according to a first posterior probability set, a pronunciation unit sequence and a preset forced alignment algorithm corresponding to each voice frame;
aiming at each voice frame, determining the posterior probability of a target pronunciation unit corresponding to the voice frame in a first posterior probability set corresponding to the voice frame, and determining the product of the posterior probability of the target pronunciation unit and the pre-stored prior probability of the target pronunciation unit as the likelihood value of the target pronunciation unit;
and determining the target dynamic verification code score corresponding to the target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame.
As an optional implementation, the verification module 960 is specifically configured to:
determining a word set corresponding to the target dynamic verification code according to the target dynamic verification code and a preset word segmentation algorithm;
aiming at each word in the word set, determining a pronunciation unit sequence corresponding to the word according to a pre-stored correspondence between the word and the pronunciation unit sequence;
and sequencing the pronunciation unit sequence corresponding to each word according to the sequence of each word in the target dynamic verification code to obtain the pronunciation unit sequence corresponding to the target dynamic verification code.
As an optional implementation, the verification module 960 is specifically configured to:
determining the difference value between the likelihood value of the target pronunciation unit corresponding to each voice frame and the maximum likelihood value in the likelihood values of the preset pronunciation units corresponding to the voice frame as the dynamic verification code score corresponding to the voice frame;
and determining the average value of the dynamic verification code scores corresponding to the voice frames as the target dynamic verification code score corresponding to the target user.
As an optional implementation, the apparatus further comprises:
the second acquisition module is used for acquiring a pre-stored first training sample set, and the first training sample set comprises a plurality of sample user identifications and first sample voice data corresponding to each sample user identification;
the second division module is used for dividing each first sample voice data in the first training sample set into at least one first sample voice frame according to a preset segmentation algorithm;
a second extraction module, configured to extract, according to a preset acoustic feature extraction algorithm, an acoustic feature vector corresponding to each first sample speech frame corresponding to the first sample speech data;
the second output module is used for inputting the acoustic feature vectors of all the first sample voice frames corresponding to the first sample voice data into the identity verification multitask model to be trained and outputting a second posterior probability set corresponding to the first sample voice data, wherein the second posterior probability set comprises posterior probabilities corresponding to all the sample user identifications;
the second determining module is used for determining a first cost function corresponding to the first training sample set according to the posterior probability of the sample user identifier corresponding to each first sample voice data;
and the first updating module is used for updating the parameters corresponding to the multitask sharing hidden layer, the parameters corresponding to the voiceprint recognition network and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the first cost function and a preset first parameter updating algorithm.
As an optional implementation, the apparatus further comprises:
the third acquisition module is used for acquiring a pre-stored second training sample set, wherein the second training sample set comprises a plurality of second sample voice frames and a sample pronunciation unit corresponding to each second sample voice frame;
the third extraction module is used for extracting an acoustic feature vector corresponding to each second sample speech frame in the second training sample set according to a preset acoustic feature extraction algorithm;
a third output module, configured to input the acoustic feature vector corresponding to the second sample speech frame to the identity verification multitask model to be trained, and output a third posterior probability set corresponding to the second sample speech frame, where the third posterior probability set includes posterior probabilities corresponding to the sample pronunciation units;
the third determining module is used for determining a second cost function corresponding to the second training sample set according to the posterior probability of the sample pronunciation unit corresponding to each second sample voice frame;
and the second updating module is used for updating the parameters corresponding to the multitask sharing hidden layer and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the second cost function and a preset second parameter updating algorithm.
As an optional implementation, the apparatus further comprises:
the fourth obtaining module is used for obtaining a plurality of pre-stored verification sample sets, and each verification sample set comprises a plurality of sample user identifications and second sample voice data corresponding to each sample user identification;
the third dividing module is used for dividing each second sample voice data in each verification sample set into at least one third sample voice frame according to a preset segmentation algorithm;
a fourth extraction module, configured to extract, according to a preset acoustic feature extraction algorithm, an acoustic feature vector corresponding to each third sample speech frame corresponding to the second sample speech data;
a fourth output module, configured to input the acoustic feature vector of each third sample speech frame corresponding to the second sample speech data into the identity verification multitask model to be verified, and output a fourth posterior probability set corresponding to the second sample speech data, where the fourth posterior probability set includes posterior probabilities corresponding to user identifiers of each sample;
a fourth determining module, configured to determine that the second sample voice data is target sample voice data if a posterior probability of the sample user identifier corresponding to the second sample voice data is a maximum value in a fourth posterior probability set corresponding to the second sample voice data;
a fifth determining module, configured to determine a ratio of the number of target sample voice data in the verification sample set to the total number of second sample voice data in the verification sample set, as an accuracy of the verification sample set;
and the sixth determining module is used for determining the change rate of the accuracy corresponding to each verification sample set according to the accuracy of each verification sample set, and if the change rate of the accuracy corresponding to the continuous preset number of verification sample sets is smaller than or equal to a preset change rate threshold value, determining that the training of the identity verification multitask model to be verified is completed.
The embodiment of the application provides an identity authentication device. The server acquires voice data input by a target user according to the target dynamic verification code, and divides the voice data into at least one voice frame according to a preset segmentation algorithm. Then, aiming at each voice frame, the server determines the acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm. And then, the server inputs the acoustic feature vector corresponding to the voice frame into a pre-trained identity verification multitask model, and outputs the intermediate user feature vector corresponding to the voice frame and a first posterior probability set. The first posterior probability set comprises posterior probabilities corresponding to the preset pronunciation units. And finally, the server determines a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm, and performs identity verification on the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame. Therefore, the server can process the voice data of the user without deploying two sets of voiceprint recognition models and voice recognition models with different structures and parameters and only deploying one set of identity verification multitask model, so that the calculation complexity of the server is reduced, and the processing efficiency of the server is improved.
In one embodiment, a computer device, as shown in fig. 10, includes a memory and a processor, the memory stores a computer program that can be executed on the processor, and the processor implements the method steps of the identity verification described in any one of the above when executing the computer program.
In an embodiment, a computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of identity verification according to any one of the preceding claims.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of identity verification, the method comprising:
acquiring voice data input by a target user according to a target dynamic verification code;
dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
aiming at each voice frame, extracting an acoustic feature vector corresponding to the voice frame according to a preset acoustic feature extraction algorithm;
inputting the acoustic characteristic vector corresponding to the voice frame into a multi-task shared hidden layer in an identity verification multi-task model, and outputting an intermediate characteristic vector corresponding to the voice frame;
inputting the intermediate characteristic vector corresponding to the voice frame into a voice recognition network in the identity verification multitask model, and outputting a pronunciation characteristic vector corresponding to the voice frame and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units;
inputting the intermediate characteristic vector and the pronunciation characteristic vector corresponding to the voice frame into a voiceprint recognition network in the identity verification multitask model, and outputting the intermediate user characteristic vector corresponding to the voice frame;
determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame, performing identity verification on the target user.
2. The method according to claim 1, wherein the authenticating the target user according to the first user feature vector corresponding to the target user and the first a posteriori probability set corresponding to each speech frame comprises:
determining a target dynamic verification code score corresponding to the target user according to the first posterior probability set corresponding to each voice frame;
if the similarity between the first user characteristic vector and a second user characteristic vector corresponding to the target user and stored in advance is greater than or equal to a preset similarity threshold value, and the target dynamic verification code number is greater than or equal to a preset dynamic verification code number threshold value, determining that the target user is a legal user;
and if the similarity between the first user characteristic vector and the second user characteristic vector is smaller than the preset similarity threshold, or the target dynamic verification code number is smaller than the preset dynamic verification code number threshold, determining that the target user is an illegal user.
3. The method according to claim 2, wherein the determining the target dynamic verification code score corresponding to the target user according to the first a posteriori probability set corresponding to each speech frame comprises:
acquiring a pronunciation unit sequence corresponding to the target dynamic verification code;
determining a target pronunciation unit corresponding to each voice frame according to the first posterior probability set corresponding to each voice frame, the pronunciation unit sequence and a preset forced alignment algorithm;
aiming at each voice frame, determining the posterior probability of a target pronunciation unit corresponding to the voice frame in a first posterior probability set corresponding to the voice frame, and determining the product of the posterior probability of the target pronunciation unit and the pre-stored prior probability of the target pronunciation unit as the likelihood value of the target pronunciation unit;
and determining the target dynamic verification code score corresponding to the target user according to the likelihood value of the target pronunciation unit corresponding to each voice frame.
4. The method according to claim 3, wherein the obtaining of the pronunciation unit sequence corresponding to the target dynamic verification code comprises:
determining a word set corresponding to the target dynamic verification code according to the target dynamic verification code and a preset word segmentation algorithm;
aiming at each word in the word set, determining a pronunciation unit sequence corresponding to the word according to a pre-stored correspondence between the word and the pronunciation unit sequence;
and sequencing the pronunciation unit sequence corresponding to each word according to the sequence of each word in the target dynamic verification code to obtain the pronunciation unit sequence corresponding to the target dynamic verification code.
5. The method of claim 3, wherein the determining the target dynamic verification code score corresponding to the target user according to the likelihood of the target pronunciation unit corresponding to each speech frame comprises:
determining the difference value between the likelihood value of the target pronunciation unit corresponding to each voice frame and the maximum likelihood value in the likelihood values of the preset pronunciation units corresponding to the voice frame as the dynamic verification code score corresponding to the voice frame;
and determining the average value of the dynamic verification code scores corresponding to the voice frames as the target dynamic verification code score corresponding to the target user.
6. The method of claim 1, further comprising:
acquiring a pre-stored first training sample set, wherein the first training sample set comprises a plurality of sample user identifications and first sample voice data corresponding to each sample user identification;
for each first sample voice data in the first training sample set, dividing the first sample voice data into at least one first sample voice frame according to a preset segmentation algorithm;
extracting acoustic feature vectors corresponding to the first sample speech frames according to a preset acoustic feature extraction algorithm aiming at each first sample speech frame corresponding to the first sample speech data;
inputting the acoustic feature vector of each first sample voice frame corresponding to the first sample voice data into an identity verification multitask model to be trained, and outputting a second posterior probability set corresponding to the first sample voice data, wherein the second posterior probability set comprises posterior probabilities corresponding to user identifiers of the samples;
determining a first cost function corresponding to the first training sample set according to the posterior probability of the sample user identification corresponding to each first sample voice data;
and updating the parameters corresponding to the multitask sharing hidden layer, the parameters corresponding to the voiceprint recognition network and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the first cost function and a preset first parameter updating algorithm.
7. The method of claim 6, further comprising:
acquiring a pre-stored second training sample set, wherein the second training sample set comprises a plurality of second sample voice frames and a sample pronunciation unit corresponding to each second sample voice frame;
extracting an acoustic feature vector corresponding to each second sample speech frame in the second training sample set according to the preset acoustic feature extraction algorithm;
inputting the acoustic feature vector corresponding to the second sample voice frame into an identity verification multitask model to be trained, and outputting a third posterior probability set corresponding to the second sample voice frame, wherein the third posterior probability set comprises posterior probabilities corresponding to all sample pronunciation units;
determining a second cost function corresponding to the second training sample set according to the posterior probability of the sample pronunciation unit corresponding to each second sample voice frame;
and updating the parameters corresponding to the multitask sharing hidden layer and the parameters corresponding to the voice recognition network in the identity verification multitask model to be trained according to the second cost function and a preset second parameter updating algorithm.
8. The method of claim 6, further comprising:
obtaining a plurality of pre-stored verification sample sets, wherein each verification sample set comprises a plurality of sample user identifications and second sample voice data corresponding to each sample user identification;
for each second sample voice data in each verification sample set, dividing the second sample voice data into at least one third sample voice frame according to a preset segmentation algorithm;
extracting an acoustic feature vector corresponding to each third sample speech frame corresponding to the second sample speech data according to a preset acoustic feature extraction algorithm;
inputting the acoustic feature vector of each third sample voice frame corresponding to the second sample voice data into an identity verification multitask model to be verified, and outputting a fourth posterior probability set corresponding to the second sample voice data, wherein the fourth posterior probability set comprises posterior probabilities corresponding to user identifiers of the samples;
if the posterior probability of the sample user identifier corresponding to the second sample voice data is the maximum value in the fourth posterior probability set corresponding to the second sample voice data, determining the second sample voice data as the target sample voice data;
determining the ratio of the number of target sample voice data in the verification sample set to the total number of second sample voice data in the verification sample set as the accuracy of the verification sample set;
and determining the change rate of the accuracy rate corresponding to each verification sample set according to the accuracy rate of each verification sample set, and if the change rate of the accuracy rate corresponding to the continuous preset number of verification sample sets is less than or equal to a preset change rate threshold value, determining that the identity verification multitask model to be verified is trained.
9. An apparatus for identity verification, the apparatus comprising:
the first acquisition module is used for acquiring voice data input by a target user according to the target dynamic verification code;
the first division module is used for dividing the voice data into at least one voice frame according to a preset segmentation algorithm;
the first extraction module is used for extracting an acoustic feature vector corresponding to each voice frame according to a preset acoustic feature extraction algorithm;
a first output module for
Inputting acoustic feature vectors corresponding to the voice frames into a multitask shared hidden layer in an identity authentication multitask model, outputting intermediate feature vectors corresponding to the voice frames, inputting the intermediate feature vectors corresponding to the voice frames into a voice recognition network in the identity authentication multitask model, outputting pronunciation feature vectors corresponding to the voice frames and a first posterior probability set, wherein the first posterior probability set comprises posterior probabilities corresponding to all preset pronunciation units, inputting the intermediate feature vectors corresponding to the voice frames and the pronunciation feature vectors into a voiceprint recognition network in the identity authentication multitask model, and outputting intermediate user feature vectors corresponding to the voice frames;
the first determining module is used for determining a first user characteristic vector corresponding to the target user according to the intermediate user characteristic vector corresponding to each voice frame and a preset pooling algorithm;
and the verification module is used for verifying the identity of the target user according to the first user characteristic vector corresponding to the target user and the first posterior probability set corresponding to each voice frame.
10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.
CN201910711306.0A 2019-08-02 2019-08-02 Identity authentication method and device, computer equipment and storage medium Active CN110379433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910711306.0A CN110379433B (en) 2019-08-02 2019-08-02 Identity authentication method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910711306.0A CN110379433B (en) 2019-08-02 2019-08-02 Identity authentication method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110379433A CN110379433A (en) 2019-10-25
CN110379433B true CN110379433B (en) 2021-10-08

Family

ID=68257916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910711306.0A Active CN110379433B (en) 2019-08-02 2019-08-02 Identity authentication method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110379433B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment
CN111599382B (en) * 2020-07-27 2020-10-27 深圳市声扬科技有限公司 Voice analysis method, device, computer equipment and storage medium
CN112927687A (en) * 2021-01-25 2021-06-08 珠海格力电器股份有限公司 Method, device and system for controlling functions of equipment and storage medium
CN113178197B (en) * 2021-04-27 2024-01-09 平安科技(深圳)有限公司 Training method and device of voice verification model and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424419A (en) * 2013-08-30 2015-03-18 鸿富锦精密工业(武汉)有限公司 Encrypting and decrypting method and system based on voiceprint recognition technology
CN107481736A (en) * 2017-08-14 2017-12-15 广东工业大学 A kind of vocal print identification authentication system and its certification and optimization method and system
CN108140386A (en) * 2016-07-15 2018-06-08 谷歌有限责任公司 Speaker verification
CN109428719A (en) * 2017-08-22 2019-03-05 阿里巴巴集团控股有限公司 A kind of auth method, device and equipment
US20190122669A1 (en) * 2016-06-01 2019-04-25 Baidu Online Network Technology (Beijing) Co., Ltd. Methods and devices for registering voiceprint and for authenticating voiceprint

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7539616B2 (en) * 2006-02-20 2009-05-26 Microsoft Corporation Speaker authentication using adapted background models
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN104834849B (en) * 2015-04-14 2018-09-18 北京远鉴科技有限公司 Dual-factor identity authentication method and system based on Application on Voiceprint Recognition and recognition of face
US10916254B2 (en) * 2016-08-22 2021-02-09 Telefonaktiebolaget Lm Ericsson (Publ) Systems, apparatuses, and methods for speaker verification using artificial neural networks
CN106971713B (en) * 2017-01-18 2020-01-07 北京华控智加科技有限公司 Speaker marking method and system based on density peak value clustering and variational Bayes
CN107104803B (en) * 2017-03-31 2020-01-07 北京华控智加科技有限公司 User identity authentication method based on digital password and voiceprint joint confirmation
WO2018209608A1 (en) * 2017-05-17 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. Method and system for robust language identification
US10347241B1 (en) * 2018-03-23 2019-07-09 Microsoft Technology Licensing, Llc Speaker-invariant training via adversarial learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424419A (en) * 2013-08-30 2015-03-18 鸿富锦精密工业(武汉)有限公司 Encrypting and decrypting method and system based on voiceprint recognition technology
US20190122669A1 (en) * 2016-06-01 2019-04-25 Baidu Online Network Technology (Beijing) Co., Ltd. Methods and devices for registering voiceprint and for authenticating voiceprint
CN108140386A (en) * 2016-07-15 2018-06-08 谷歌有限责任公司 Speaker verification
CN107481736A (en) * 2017-08-14 2017-12-15 广东工业大学 A kind of vocal print identification authentication system and its certification and optimization method and system
CN109428719A (en) * 2017-08-22 2019-03-05 阿里巴巴集团控股有限公司 A kind of auth method, device and equipment

Also Published As

Publication number Publication date
CN110379433A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN110379433B (en) Identity authentication method and device, computer equipment and storage medium
KR101757990B1 (en) Method and device for voiceprint indentification
CN108447471B (en) Speech recognition method and speech recognition device
CN107104803B (en) User identity authentication method based on digital password and voiceprint joint confirmation
WO2017113658A1 (en) Artificial intelligence-based method and device for voiceprint authentication
Das et al. Development of multi-level speech based person authentication system
US11348590B2 (en) Methods and devices for registering voiceprint and for authenticating voiceprint
WO2017162053A1 (en) Identity authentication method and device
WO2020077885A1 (en) Identity authentication method and apparatus, computer device and storage medium
US20060222210A1 (en) System, method and computer program product for determining whether to accept a subject for enrollment
US20070219792A1 (en) Method and system for user authentication based on speech recognition and knowledge questions
CN109462482B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and computer readable storage medium
JP2007133414A (en) Method and apparatus for estimating discrimination capability of voice and method and apparatus for registration and evaluation of speaker authentication
US10909991B2 (en) System for text-dependent speaker recognition and method thereof
CN111883140A (en) Authentication method, device, equipment and medium based on knowledge graph and voiceprint recognition
CN110111798B (en) Method, terminal and computer readable storage medium for identifying speaker
US20210166715A1 (en) Encoded features and rate-based augmentation based speech authentication
Scheffer et al. Content matching for short duration speaker recognition.
CN111091837A (en) Time-varying voiceprint authentication method and system based on online learning
JPWO2018088534A1 (en) Electronic device, electronic device control method, and electronic device control program
WO2018137426A1 (en) Method and apparatus for recognizing voice information of user
CN111339517B (en) Voiceprint feature sampling method, user identification method, device and electronic equipment
WO2000058947A1 (en) User authentication for consumer electronics
Liu et al. Feature selection for fusion of speaker verification via maximum kullback-leibler distance
TWI778234B (en) Speaker verification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant