CN108694952B - Electronic device, identity authentication method and storage medium - Google Patents

Electronic device, identity authentication method and storage medium Download PDF

Info

Publication number
CN108694952B
CN108694952B CN201810311721.2A CN201810311721A CN108694952B CN 108694952 B CN108694952 B CN 108694952B CN 201810311721 A CN201810311721 A CN 201810311721A CN 108694952 B CN108694952 B CN 108694952B
Authority
CN
China
Prior art keywords
user
voice
read
preset
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810311721.2A
Other languages
Chinese (zh)
Other versions
CN108694952A (en
Inventor
王健宗
于夕畔
李瑾瑾
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810311721.2A priority Critical patent/CN108694952B/en
Priority to PCT/CN2018/102208 priority patent/WO2019196305A1/en
Publication of CN108694952A publication Critical patent/CN108694952A/en
Application granted granted Critical
Publication of CN108694952B publication Critical patent/CN108694952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/16Hidden Markov models [HMM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention relates to an electronic device, an identity authentication method and a storage medium, wherein the method comprises the following steps: when a user transacts services in an IVR scene, broadcasting a random code with a first preset digit for the user to read, and respectively establishing acoustic models with preset types for the broadcasted random code and the voice read by the user after reading; carrying out forced integral alignment operation on the acoustic model of the broadcasted random code and the acoustic model of the voice which is read by the user at the time, and calculating the probability that the two acoustic models after alignment are the same by using a preset algorithm; if the probability is greater than a preset first threshold value, extracting the voiceprint feature vector of the voice read-after of the user, acquiring a standard voiceprint feature vector pre-stored by the user after the user successfully registers, and calculating the voiceprint feature vector of the voice read-after of the user and the distance between the standard voiceprint feature vector so as to carry out identity verification on the user. The invention carries out double verification on the user identity and can accurately confirm the user identity.

Description

Electronic device, identity authentication method and storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to an electronic device, an identity authentication method, and a storage medium.
Background
Currently, in an Interactive Voice Response (IVR) scenario, a scheme for verifying the identity of a client by combining the IVR with voiceprint recognition is provided, for example, when the client uses a phone to activate a credit card or modify a password after receiving the credit card, the identity of the client needs to be verified. In the prior art, in an interactive Voice response (ivr) scenario, since both remote voiceprint verifications are not performed on the same side, there may be a fraudulent behavior that the client uses a prepared synthesized Voice, the client identity cannot be confirmed accurately, and the security of the authentication is low.
Disclosure of Invention
The invention aims to provide an electronic device, an identity authentication method and a storage medium, aiming at carrying out double authentication on the identity of a user and accurately confirming the identity of the user.
In order to achieve the above object, the present invention provides an electronic device, which includes a memory and a processor connected to the memory, wherein the memory stores a processing system capable of running on the processor, and when executed by the processor, the processing system implements the following steps:
an acoustic model establishing step, namely broadcasting a random code with a first preset digit for follow-up reading of a user when the user transacts business in an Interactive Voice Response (IVR) scene, and respectively establishing acoustic models with preset types for the broadcasted random code and the voice of the user read-up after the follow-up reading;
a forced integral alignment step, wherein the acoustic model of the broadcasted random code and the acoustic model of the voice which is read by the user at the time are subjected to forced integral alignment operation, and the same probability of the two aligned acoustic models is calculated by using a preset algorithm;
and an identity verification step, namely if the probability that the two aligned acoustic models are the same is greater than a preset first threshold value, extracting the voiceprint characteristic vector of the voice read with the user at this time, acquiring a standard voiceprint characteristic vector pre-stored by the user after the user successfully registers, and calculating the voiceprint characteristic vector of the voice read with the user at this time and the distance between the standard voiceprint characteristic vectors so as to verify the identity of the user.
Preferably, the processing system, when executed by the processor, further implements the steps of:
broadcasting a random code with a second preset digit for the user to read after a preset time when the user performs voiceprint registration in an Interactive Voice Response (IVR) scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after the user reads after each time;
respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
if the probabilities of the two aligned acoustic models which are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user at each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user at each time is the same user;
if yes, the voiceprint feature vector is used as the standard voiceprint feature vector of the user to be stored.
Preferably, the acoustic model of the preset type is a deep neural network-hidden markov model.
Preferably, the step of extracting the voiceprint feature vector of the voice that the user reads with the current time includes:
pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum;
performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
In order to achieve the above object, the present invention further provides an identity authentication method, where the identity authentication method includes:
s1, when a user transacts business in an Interactive Voice Response (IVR) scene, broadcasting a random code with a first preset digit for the user to read, and establishing acoustic models with preset types for the broadcasted random code and the voice of the user read after reading;
s2, performing forced integral alignment operation on the acoustic model of the random code broadcasted this time and the acoustic model of the voice read by the user this time, and calculating the probability that the two acoustic models after alignment are the same by using a preset algorithm;
and S3, if the probability that the two aligned acoustic models are the same is greater than a preset first threshold, extracting the voiceprint feature vector of the voice read with the user this time, acquiring a standard voiceprint feature vector pre-stored by the user after the user successfully registers, and calculating the voiceprint feature vector of the voice read with the user this time and the distance between the standard voiceprint feature vectors so as to carry out identity verification on the user.
Preferably, before the step S1, the method further includes:
s01, broadcasting a random code with a second preset digit for the user to read after for a preset time when the user performs voiceprint registration in an interactive voice response IVR scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after by the user after each reading;
s02, respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
s03, if the probabilities that the two acoustic models after alignment are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user each time is the same user;
and S04, if yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user.
Preferably, the acoustic model of the preset type is a deep neural network-hidden markov model.
Preferably, the step of extracting the voiceprint feature vector of the voice that the user reads with the current time includes:
pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum;
performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
Preferably, the step of calculating the distance between the voiceprint feature vector of the voice which is read with the user at this time and the standard voiceprint feature vector includes:
Figure BDA0001622558530000041
wherein, the
Figure BDA0001622558530000042
Is a standard voiceprint feature vector, the
Figure BDA0001622558530000043
And the voice print characteristic vector of the voice which is read by the user at this time.
The invention also provides a computer readable storage medium having stored thereon a processing system, which when executed by a processor implements the steps of the method of identity verification described above.
The invention has the beneficial effects that: when the identity recognition is carried out in the interactive voice response IVR scene, the random code is used for the user to read along, the fraud of the synthetic voice prepared in advance can be effectively prevented, the random code is combined with the voiceprint recognition, the dual verification of the user identity is realized, the user identity can be accurately confirmed, the safety of the identity verification in the interactive voice response IVR scene is improved, in addition, the forced integral alignment operation is carried out on the acoustic model of the broadcasted random code and the acoustic model of the voice read along by the user, the calculated amount can be reduced, and the identity recognition efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of an alternative application environment according to various embodiments of the present invention;
fig. 2 is a flowchart illustrating an embodiment of an authentication method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an application environment of the method for authenticating identity according to the preferred embodiment of the present invention. The application environment diagram includes the electronic device 1 and the terminal equipment. The electronic apparatus 1 may perform data interaction with the terminal device through a suitable technology such as a network, a near field communication technology, and the like. In this embodiment, the user logs in the interactive voice response IVR system of the electronic device 1 through the terminal device to perform voiceprint registration and voiceprint recognition operations.
The terminal device includes, but is not limited to, any electronic product capable of performing man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a mobile device such as a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive web Television (IPTV), an intelligent wearable device, a navigation device, or the like, or a fixed terminal such as a Digital TV, a desktop computer, a notebook, a server, or the like.
The electronic apparatus 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set in advance or stored. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.
In the present embodiment, the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores a processing system operable on the processor 12. It is noted that fig. 1 only shows the electronic device 1 with components 11-13, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic apparatus 1, such as a hard disk of the electronic apparatus 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic apparatus 1, such as a plug-in hard disk provided on the electronic apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various types of application software installed in the electronic device 1, for example, program codes of a processing system in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic apparatus 1, such as performing control and processing related to data interaction or communication with the terminal device. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run a processing system.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing a communication connection between the electronic apparatus 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic apparatus 1 with one or more terminal devices, and establish a data transmission channel and a communication connection between the electronic apparatus 1 and the one or more terminal devices.
The processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, which is executable by the processor 12 to implement the method of the embodiments of the present application; and the at least one computer readable instruction may be divided into different logic blocks depending on the functions implemented by the respective portions.
In one embodiment, the processing system described above, when executed by the processor 12, performs the following steps:
an acoustic model establishing step, namely broadcasting a random code with a first preset digit for follow-up reading of a user when the user transacts business in an Interactive Voice Response (IVR) scene, and respectively establishing acoustic models with preset types for the broadcasted random code and the voice of the user read-up after the follow-up reading;
in an Interactive Voice Response (IVR) scene, when a user requests to handle a service, an identity identification code, such as an identity card number, is sent, after the request of the user is received, whether the service handled by the user needs further identity verification is analyzed, whether the user is registered with a voiceprint is analyzed according to the identity identification code of the user, if the user needs further identity verification and the user is registered with the voiceprint, a random code with a first preset digit is generated, the random code is broadcasted in a voice mode by adopting a voice synthesis technology, the user is guided to follow up reading, and the first preset digit is 8 digits for example.
After the user follows reading, an acoustic model of a preset type is established for the voice of the broadcasted random code, and an acoustic model of a preset type is established for the voice of the user which follows reading. In a preferred embodiment, the acoustic model of the predetermined type is a deep neural network-hidden markov acoustic model, i.e., a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden markov acoustic model.
In a specific example, a DNN-HMM acoustic model is taken as an example, where HMM is used to describe dynamic changes of a speech signal, and each output node of DNN is used to estimate a posterior probability of a certain state of a continuous density HMM, so as to obtain a DNN-HMM model. The voice of the random code broadcasted this time and the voice of the user read with the voice are a series of syllables, and if the user wants to recognize the characters, the syllables are a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, based on a predetermined character voice library, a DNN-HMM acoustic model of the voice of the random code broadcasted this time and a DNN-HMM acoustic model of the voice that the user follows this time are obtained through global character acoustic adaptive training.
A forced integral alignment step, wherein the acoustic model of the broadcasted random code and the acoustic model of the voice which is read by the user at the time are subjected to forced integral alignment operation, and the same probability of the two aligned acoustic models is calculated by using a preset algorithm;
compared with the traditional method adopting word-by-word comparison, the method can greatly reduce the calculated amount and is beneficial to improving the efficiency of identity recognition.
The predetermined algorithm is a-posteriori probability algorithm in one embodiment, and may be a similarity algorithm in other embodiments, for example, the similarity algorithm is to calculate edit distances of characters in two aligned acoustic models, and the smaller the edit distance is, the greater the probability that the two aligned acoustic models are the same is; the similarity algorithm can also be a longest common subsequence algorithm, and if the difference between the obtained longest common subsequence and the length of the characters in the two aligned acoustic models is smaller, the probability that the two aligned acoustic models are the same is higher.
And an identity verification step, namely if the probability that the two aligned acoustic models are the same is greater than a preset first threshold value, extracting the voiceprint characteristic vector of the voice read with the user at this time, acquiring a standard voiceprint characteristic vector pre-stored by the user after the user successfully registers, and calculating the voiceprint characteristic vector of the voice read with the user at this time and the distance between the standard voiceprint characteristic vectors so as to verify the identity of the user.
In this embodiment, if the probability that the two aligned acoustic models are the same is greater than a preset first threshold, for example, the preset first threshold is 0.985, it is determined that the character read by the user at this time is consistent with the random code broadcasted at this time. Because the random code is broadcasted, the method can effectively prevent the synthesized voice prepared by the user in advance from cheating, and improve the safety of identity identification.
In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user reads with the current time includes: pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum; performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
The voice which is read by the user at this time is framed, then the framed voice data is pre-emphasized, the pre-emphasis is actually high-pass filtering, low-frequency data is filtered, and high-frequency characteristics in the voice data are more prominent, specifically, the transfer function of the high-pass filtering is H (Z) -1- α Z-1Wherein, in the step (A),z is speech data, α is a constant coefficient, preferably α is 0.97, and since speech deviates to some extent from the original speech after framing, windowing of the speech data is required.
In this embodiment, the cepstrum analysis is performed on the mel-frequency spectrum, for example, taking a logarithm and performing an inverse transformation, the inverse transformation is generally realized by DCT discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as mel-frequency cepstrum coefficients MFCC. The Mel frequency cepstrum coefficient MFCC is the voice print characteristic of the frame of voice data, the Mel frequency cepstrum coefficient MFCC of each frame is formed into a characteristic data matrix, and the characteristic data matrix is the voice print characteristic vector of the voice read by the user at this time.
In the embodiment, the Mel frequency cepstrum coefficients MFCC of the voice data are taken to form corresponding voiceprint feature vectors, and the voiceprint feature vectors are more similar to the human auditory system than the frequency bands used for linear intervals in the normal log cepstrum, so that the accuracy of identity verification can be improved.
In an embodiment, calculating the distance between the voiceprint feature vector of the voice that the user reads with the current time and the standard voiceprint feature vector is to calculate the cosine distance between the voiceprint feature vector and the standard voiceprint feature vector, and includes:
Figure BDA0001622558530000091
wherein, the
Figure BDA0001622558530000092
Is a standard voiceprint feature vector, the
Figure BDA0001622558530000093
And the voice print characteristic vector of the voice which is read by the user at this time.
If the cosine distance is smaller than or equal to the preset distance threshold, the identity authentication is passed; and if the cosine distance is greater than the preset distance threshold, the identity authentication is not passed.
In an embodiment, the step of registering the voiceprint includes:
broadcasting a random code with a second preset digit for the user to read after a preset time when the user performs voiceprint registration in an Interactive Voice Response (IVR) scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after the user reads after each time;
respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
if the probabilities of the two aligned acoustic models which are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user at each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user at each time is the same user;
if so, storing the voiceprint feature vector as a standard voiceprint feature vector of the user;
if not, prompting the user to re-enter, and performing the step of registering the voiceprint again.
In an Interactive Voice Response (IVR) scene, when a user requests registration, an identification code, such as an identity card number, is sent, after the request of the user is received, a random code with a second preset digit is generated and is broadcasted in a voice form by adopting a voice synthesis technology, and the user is guided to read for a preset time (for example, 3 times), wherein the second preset digit is 8 digits for example.
After the user follows reading, an acoustic model of a preset type is established for the voice of the random code broadcasted each time, and an acoustic model of a preset type is established for the voice of the user read following each time. In a preferred embodiment, the acoustic model of the predetermined type is a deep neural network-hidden markov acoustic model, i.e., a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden markov acoustic model. For specific examples, reference may be made to the above-mentioned embodiments, which are not described herein again.
In a specific example, a DNN-HMM acoustic model is taken as an example, where HMM is used to describe dynamic changes of a speech signal, and each output node of DNN is used to estimate a posterior probability of a certain state of a continuous density HMM, so as to obtain a DNN-HMM model. The speech of the random code broadcasted every time and the speech followed by the user are a series of syllables, and if the words are recognized, a series of characters. In the embodiment, when a DNN-HMM acoustic model is established, based on a predetermined character speech library, a DNN-HMM acoustic model of speech of a broadcasted random code and a DNN-HMM acoustic model of speech that the user follows are obtained through global character acoustic adaptive training.
Compared with the traditional method adopting word-by-word comparison, the method can greatly reduce the calculation amount and is beneficial to improving the efficiency of identity recognition.
In an embodiment, the predetermined algorithm is a posterior-anterior probability algorithm, and in other embodiments, the predetermined algorithm may also be a similarity algorithm, and for a specific example, reference may be made to the above-mentioned embodiment, which is not described herein again.
In this embodiment, if the probability that the two aligned acoustic models are the same is greater than a preset second threshold, for example, the preset second threshold is 0.985, it is considered that the character read by the user at each time is consistent with the broadcasted random code. Because the random code is broadcasted, the method can effectively prevent the synthesized voice prepared by the user in advance from cheating, and improve the safety of identity identification.
In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user reads with each time is substantially the same as the method of extracting the voiceprint feature vector of the voice in the above embodiment, and details are not repeated here.
In an embodiment, the step of calculating the distance between two voiceprint feature vectors is substantially the same as the step of calculating the cosine distance, and is not described herein again.
If the cosine distance is smaller than or equal to a preset distance threshold, the users reading at each time are the same user, and the voiceprint characteristic vector is used as a standard voiceprint characteristic vector of the user to be stored; and if the cosine distance is greater than the preset distance threshold, the user who follows reading at each time is not the same user, and the user is prompted to re-register.
Compared with the prior art, when the identity recognition is carried out in the interactive voice response IVR scene, the random code is used for the user to read along, so that the fraud of synthetic tones prepared in advance can be effectively prevented, the random code is combined with voiceprint recognition, the double verification of the user identity is realized, the user identity can be accurately confirmed, the safety of the identity verification in the interactive voice response IVR scene is improved, in addition, the acoustic model of the broadcasted random code and the acoustic model of the voice read along by the user are subjected to forced integral alignment operation, the calculated amount can be reduced, and the identity recognition efficiency is improved.
As shown in fig. 2, fig. 2 is a schematic flowchart of an embodiment of the method for authenticating identity, where the method for authenticating identity includes the following steps:
step S1, when a user transacts business in an interactive voice response IVR scene, broadcasting a random code with a first preset digit for the user to read, and establishing acoustic models with preset types for the broadcasted random code and the voice read by the user after reading;
in an Interactive Voice Response (IVR) scene, when a user requests to handle a service, an identity identification code, such as an identity card number, is sent, after the request of the user is received, whether the service handled by the user needs further identity verification is analyzed, whether the user is registered with a voiceprint is analyzed according to the identity identification code of the user, if the user needs further identity verification and the user is registered with the voiceprint, a random code with a first preset digit is generated, the random code is broadcasted in a voice mode by adopting a voice synthesis technology, the user is guided to follow up reading, and the first preset digit is 8 digits for example.
After the user follows reading, an acoustic model of a preset type is established for the voice of the broadcasted random code, and an acoustic model of a preset type is established for the voice of the user which follows reading. In a preferred embodiment, the acoustic model of the predetermined type is a deep neural network-hidden markov acoustic model, i.e., a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden markov acoustic model.
In a specific example, a DNN-HMM acoustic model is taken as an example, where HMM is used to describe dynamic changes of a speech signal, and each output node of DNN is used to estimate a posterior probability of a certain state of a continuous density HMM, so as to obtain a DNN-HMM model. The voice of the random code broadcasted this time and the voice of the user read with the voice are a series of syllables, and if the user wants to recognize the characters, the syllables are a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, based on a predetermined character voice library, a DNN-HMM acoustic model of the voice of the random code broadcasted this time and a DNN-HMM acoustic model of the voice that the user follows this time are obtained through global character acoustic adaptive training.
Step S2, performing forced integral alignment operation on the acoustic model of the random code broadcasted this time and the acoustic model of the voice read by the user this time, and calculating the probability that the two acoustic models after alignment are the same by using a preset algorithm;
compared with the traditional method adopting word-by-word comparison, the method can greatly reduce the calculated amount and is beneficial to improving the efficiency of identity recognition.
The predetermined algorithm is a-posteriori probability algorithm in one embodiment, and may be a similarity algorithm in other embodiments, for example, the similarity algorithm is to calculate edit distances of characters in two aligned acoustic models, and the smaller the edit distance is, the greater the probability that the two aligned acoustic models are the same is; the similarity algorithm can also be a longest common subsequence algorithm, and if the difference between the obtained longest common subsequence and the length of the characters in the two aligned acoustic models is smaller, the probability that the two aligned acoustic models are the same is higher.
Step S3, if the probability that the two aligned acoustic models are the same is greater than a preset first threshold, extracting the voiceprint feature vector of the voice that the user reads with this time, obtaining a standard voiceprint feature vector that is pre-stored after the user successfully registers, and calculating the voiceprint feature vector of the voice that the user reads with this time and the distance between the standard voiceprint feature vectors, so as to perform identity verification on the user.
In this embodiment, if the probability that the two aligned acoustic models are the same is greater than a preset first threshold, for example, the preset first threshold is 0.985, it is determined that the character read by the user at this time is consistent with the random code broadcasted at this time. Because the random code is broadcasted, the method can effectively prevent the synthesized voice prepared by the user in advance from cheating, and improve the safety of identity identification.
In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user reads with the current time includes: pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum; performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
The voice which is read by the user at this time is framed, then the framed voice data is pre-emphasized, the pre-emphasis is actually high-pass filtering, low-frequency data is filtered, and high-frequency characteristics in the voice data are more prominent, specifically, the transfer function of the high-pass filtering is H (Z) -1- α Z-1Where Z is speech data and α is a constant coefficient, preferably α is 0.97, and since the speech deviates from the original speech to some extent after framing, the speech data needs to be windowed.
In this embodiment, the cepstrum analysis is performed on the mel-frequency spectrum, for example, taking a logarithm and performing an inverse transformation, the inverse transformation is generally realized by DCT discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as mel-frequency cepstrum coefficients MFCC. The Mel frequency cepstrum coefficient MFCC is the voice print characteristic of the frame of voice data, the Mel frequency cepstrum coefficient MFCC of each frame is formed into a characteristic data matrix, and the characteristic data matrix is the voice print characteristic vector of the voice read by the user at this time.
In the embodiment, the Mel frequency cepstrum coefficients MFCC of the voice data are taken to form corresponding voiceprint feature vectors, and the voiceprint feature vectors are more similar to the human auditory system than the frequency bands used for linear intervals in the normal log cepstrum, so that the accuracy of identity verification can be improved.
In an embodiment, calculating the distance between the voiceprint feature vector of the voice that the user reads with the current time and the standard voiceprint feature vector is to calculate the cosine distance between the voiceprint feature vector and the standard voiceprint feature vector, and includes:
Figure BDA0001622558530000141
wherein, the
Figure BDA0001622558530000142
Is a standard voiceprint feature vector, the
Figure BDA0001622558530000143
And the voice print characteristic vector of the voice which is read by the user at this time.
If the cosine distance is smaller than or equal to the preset distance threshold, the identity authentication is passed; and if the cosine distance is greater than the preset distance threshold, the identity authentication is not passed.
In an embodiment, the step of registering the voiceprint includes:
broadcasting a random code with a second preset digit for the user to read after a preset time when the user performs voiceprint registration in an Interactive Voice Response (IVR) scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after the user reads after each time;
respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
if the probabilities of the two aligned acoustic models which are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user at each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user at each time is the same user;
if so, storing the voiceprint feature vector as a standard voiceprint feature vector of the user;
if not, prompting the user to re-enter, and performing the step of registering the voiceprint again.
In an Interactive Voice Response (IVR) scene, when a user requests registration, an identification code, such as an identity card number, is sent, after the request of the user is received, a random code with a second preset digit is generated and is broadcasted in a voice form by adopting a voice synthesis technology, and the user is guided to read for a preset time (for example, 3 times), wherein the second preset digit is 8 digits for example.
After the user follows reading, an acoustic model of a preset type is established for the voice of the random code broadcasted each time, and an acoustic model of a preset type is established for the voice of the user read following each time. In a preferred embodiment, the acoustic model of the predetermined type is a deep neural network-hidden markov acoustic model, i.e., a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden markov acoustic model. For specific examples, reference may be made to the above-mentioned embodiments, which are not described herein again.
In a specific example, a DNN-HMM acoustic model is taken as an example, where HMM is used to describe dynamic changes of a speech signal, and each output node of DNN is used to estimate a posterior probability of a certain state of a continuous density HMM, so as to obtain a DNN-HMM model. The speech of the random code broadcasted every time and the speech followed by the user are a series of syllables, and if the words are recognized, a series of characters. In the embodiment, when a DNN-HMM acoustic model is established, based on a predetermined character speech library, a DNN-HMM acoustic model of speech of a broadcasted random code and a DNN-HMM acoustic model of speech that the user follows are obtained through global character acoustic adaptive training.
Compared with the traditional method adopting word-by-word comparison, the method can greatly reduce the calculation amount and is beneficial to improving the efficiency of identity recognition.
In an embodiment, the predetermined algorithm is a posterior-anterior probability algorithm, and in other embodiments, the predetermined algorithm may also be a similarity algorithm, and for a specific example, reference may be made to the above-mentioned embodiment, which is not described herein again.
In this embodiment, if the probability that the two aligned acoustic models are the same is greater than a preset second threshold, for example, the preset second threshold is 0.985, it is considered that the character read by the user at each time is consistent with the broadcasted random code. Because the random code is broadcasted, the method can effectively prevent the synthesized voice prepared by the user in advance from cheating, and improve the safety of identity identification.
In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user reads with each time is substantially the same as the method of extracting the voiceprint feature vector of the voice in the above embodiment, and details are not repeated here.
In an embodiment, the step of calculating the distance between two voiceprint feature vectors is substantially the same as the step of calculating the cosine distance, and is not described herein again.
If the cosine distance is smaller than or equal to a preset distance threshold, the users reading at each time are the same user, and the voiceprint characteristic vector is used as a standard voiceprint characteristic vector of the user to be stored; and if the cosine distance is greater than the preset distance threshold, the user who follows reading at each time is not the same user, and the user is prompted to re-register.
The invention also provides a computer readable storage medium having stored thereon a processing system, which when executed by a processor implements the steps of the method of identity verification described above.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. An electronic device, comprising a memory and a processor connected to the memory, wherein the memory stores a processing system operable on the processor, and the processing system when executed by the processor implements the following steps:
an acoustic model establishing step, namely broadcasting a random code with a first preset digit for follow-up reading of a user when the user transacts business in an Interactive Voice Response (IVR) scene, and respectively establishing acoustic models with preset types for the broadcasted random code and the voice of the user read-up after the follow-up reading;
a forced integral alignment step, wherein the acoustic model of the broadcasted random code and the acoustic model of the voice which is read by the user at the time are subjected to forced integral alignment operation, and the same probability of the two aligned acoustic models is calculated by using a preset algorithm;
an identity verification step, namely if the probability that the two aligned acoustic models are the same is greater than a preset first threshold value, extracting the voiceprint characteristic vector of the voice read with the user at this time, acquiring a standard voiceprint characteristic vector pre-stored by the user after the user successfully registers, and calculating the voiceprint characteristic vector of the voice read with the user at this time and the distance between the standard voiceprint characteristic vectors so as to verify the identity of the user;
when executed by the processor, the processing system further implements the steps of:
broadcasting a random code with a second preset digit for the user to read after a preset time when the user performs voiceprint registration in an Interactive Voice Response (IVR) scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after the user reads after each time;
respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
if the probabilities of the two aligned acoustic models which are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user at each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user at each time is the same user;
if yes, the voiceprint feature vector is used as the standard voiceprint feature vector of the user to be stored.
2. The electronic device according to claim 1, wherein the acoustic model of the preset type is a deep neural network-hidden markov model.
3. The electronic device according to claim 1, wherein the step of extracting the voiceprint feature vector of the speech followed by the user comprises:
pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum;
performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
4. A method of identity verification, the method of identity verification comprising:
s1, when a user transacts business in an Interactive Voice Response (IVR) scene, broadcasting a random code with a first preset digit for the user to read, and establishing acoustic models with preset types for the broadcasted random code and the voice of the user read after reading;
s2, performing forced integral alignment operation on the acoustic model of the random code broadcasted this time and the acoustic model of the voice read by the user this time, and calculating the probability that the two acoustic models after alignment are the same by using a preset algorithm;
s3, if the probability that the two acoustic models after alignment are the same is greater than a preset first threshold, extracting the voiceprint feature vector of the voice read with the user this time, acquiring a standard voiceprint feature vector pre-stored by the user after the user successfully registers, and calculating the voiceprint feature vector of the voice read with the user this time and the distance of the standard voiceprint feature vector so as to carry out identity verification on the user;
before the step S1, the method further includes:
s01, broadcasting a random code with a second preset digit for the user to read after for a preset time when the user performs voiceprint registration in an interactive voice response IVR scene, and respectively establishing an acoustic model of the preset type for the broadcasted random code and the voice read after by the user after each reading;
s02, respectively carrying out forced integral alignment operation on the acoustic model of the random code broadcasted each time and the acoustic model of the voice read by the corresponding user, and calculating the same probability of the two acoustic models after alignment by using a preset algorithm;
s03, if the probabilities that the two acoustic models after alignment are the same are both larger than a preset second threshold, extracting voiceprint feature vectors of the speech which is read by the user each time, and calculating the distance between every two voiceprint feature vectors so as to analyze whether the user which is read by the user each time is the same user;
and S04, if yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user.
5. The method of identity verification according to claim 4, wherein the acoustic model of the preset type is a deep neural network-hidden Markov model.
6. The method of claim 4, wherein the step of extracting the voiceprint feature vector of the voice that the user reads with this time comprises:
pre-emphasis and windowing are carried out on the voice which is read by the user at this time, Fourier transform is carried out on each windowing to obtain a corresponding frequency spectrum, and the frequency spectrum is input into a Mel filter to be output to obtain a Mel frequency spectrum;
performing cepstrum analysis on the Mel frequency spectrum to obtain Mel frequency cepstrum coefficients MFCC, and forming a voiceprint feature vector of the voice read with the user at this time based on the Mel frequency cepstrum coefficients MFCC.
7. The method of claim 4, wherein the step of calculating the distance between the voiceprint feature vector of the voice read with the user and the standard voiceprint feature vector comprises:
Figure FDA0002318697040000031
wherein, the
Figure FDA0002318697040000032
Is a standard voiceprint feature vector, the
Figure FDA0002318697040000033
And the voice print characteristic vector of the voice which is read by the user at this time.
8. A computer-readable storage medium, having stored thereon a processing system which, when executed by a processor, carries out the steps of the method of identity verification according to any one of claims 4 to 7.
CN201810311721.2A 2018-04-09 2018-04-09 Electronic device, identity authentication method and storage medium Active CN108694952B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810311721.2A CN108694952B (en) 2018-04-09 2018-04-09 Electronic device, identity authentication method and storage medium
PCT/CN2018/102208 WO2019196305A1 (en) 2018-04-09 2018-08-24 Electronic device, identity verification method, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810311721.2A CN108694952B (en) 2018-04-09 2018-04-09 Electronic device, identity authentication method and storage medium

Publications (2)

Publication Number Publication Date
CN108694952A CN108694952A (en) 2018-10-23
CN108694952B true CN108694952B (en) 2020-04-28

Family

ID=63844884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810311721.2A Active CN108694952B (en) 2018-04-09 2018-04-09 Electronic device, identity authentication method and storage medium

Country Status (2)

Country Link
CN (1) CN108694952B (en)
WO (1) WO2019196305A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448732B (en) * 2018-12-27 2021-06-08 科大讯飞股份有限公司 Digital string voice processing method and device
CN110536029B (en) * 2019-08-15 2021-11-16 咪咕音乐有限公司 Interaction method, network side equipment, terminal equipment, storage medium and system
CN110491393B (en) * 2019-08-30 2022-04-22 科大讯飞股份有限公司 Training method of voiceprint representation model and related device
CN111161746B (en) * 2019-12-31 2022-04-15 思必驰科技股份有限公司 Voiceprint registration method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680497A (en) * 2012-08-31 2014-03-26 百度在线网络技术(北京)有限公司 Voice recognition system and voice recognition method based on video
CN103986725A (en) * 2014-05-29 2014-08-13 中国农业银行股份有限公司 Client side, server side and identity authentication system and method
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680497A (en) * 2012-08-31 2014-03-26 百度在线网络技术(北京)有限公司 Voice recognition system and voice recognition method based on video
CN103986725A (en) * 2014-05-29 2014-08-13 中国农业银行股份有限公司 Client side, server side and identity authentication system and method
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium

Also Published As

Publication number Publication date
WO2019196305A1 (en) 2019-10-17
CN108694952A (en) 2018-10-23

Similar Documents

Publication Publication Date Title
US11068571B2 (en) Electronic device, method and system of identity verification and computer readable storage medium
CN108694952B (en) Electronic device, identity authentication method and storage medium
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
CN108564954B (en) Deep neural network model, electronic device, identity verification method, and storage medium
WO2019100606A1 (en) Electronic device, voiceprint-based identity verification method and system, and storage medium
US10645081B2 (en) Method and apparatus for authenticating user
CN107610709B (en) Method and system for training voiceprint recognition model
CN109428719B (en) Identity verification method, device and equipment
JP6096333B2 (en) Method, apparatus and system for verifying payment
CN108650266B (en) Server, voiceprint verification method and storage medium
US20160014120A1 (en) Method, server, client and system for verifying verification codes
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
WO2019136912A1 (en) Electronic device, identity authentication method and system, and storage medium
CN109462482B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and computer readable storage medium
KR20210050884A (en) Registration method and apparatus for speaker recognition
CN108630208B (en) Server, voiceprint-based identity authentication method and storage medium
CN108447489B (en) Continuous voiceprint authentication method and system with feedback
AU2018201573B2 (en) Methods and Systems for Determining User Liveness
KR101424962B1 (en) Authentication system and method based by voice
CN113112992B (en) Voice recognition method and device, storage medium and server
CN112071331A (en) Voice file repairing method and device, computer equipment and storage medium
CN113436633B (en) Speaker recognition method, speaker recognition device, computer equipment and storage medium
KR20200107707A (en) Registration method and apparatus for speaker recognition
CN115690920B (en) Credible living body detection method for medical identity authentication and related equipment
CN111933150A (en) Text-related speaker identification method based on bidirectional compensation mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant