CN108447491B - Intelligent voice recognition method - Google Patents

Intelligent voice recognition method Download PDF

Info

Publication number
CN108447491B
CN108447491B CN201810224944.5A CN201810224944A CN108447491B CN 108447491 B CN108447491 B CN 108447491B CN 201810224944 A CN201810224944 A CN 201810224944A CN 108447491 B CN108447491 B CN 108447491B
Authority
CN
China
Prior art keywords
voice
authentication
power
user
pos machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810224944.5A
Other languages
Chinese (zh)
Other versions
CN108447491A (en
Inventor
李仁超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Cinda Outwit Technology Co ltd
Original Assignee
Chengdu Cinda Outwit Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Cinda Outwit Technology Co ltd filed Critical Chengdu Cinda Outwit Technology Co ltd
Priority to CN201810224944.5A priority Critical patent/CN108447491B/en
Publication of CN108447491A publication Critical patent/CN108447491A/en
Application granted granted Critical
Publication of CN108447491B publication Critical patent/CN108447491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/08Payment architectures
    • G06Q20/20Point-of-sale [POS] network systems
    • G06Q20/206Point-of-sale [POS] network systems comprising security or operator identification provisions, e.g. password entry
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3247Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving digital signatures
    • H04L9/3249Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving digital signatures using RSA or related signature schemes, e.g. Rabin scheme

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Accounting & Taxation (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Cash Registers Or Receiving Machines (AREA)

Abstract

The invention provides an intelligent voice recognition method, which comprises the following steps: step 1: distinguishing silence and voice by taking the short-time power and ZCR as characteristics, and carrying out end point detection; step 2: dividing the voice signal after the end point detection into a plurality of equal-length frames; and step 3: obtaining voice signal characteristics through dynamic changes of audio power; and 4, step 4: and performing user identity authentication of the intelligent POS machine based on the comparison result of the voice signal characteristics. The invention provides an intelligent voice recognition method, which realizes local storage, comparison and operation of the identity authentication data of an intelligent POS machine terminal, does not need to configure hardware password equipment, does not need to upload the data to a payment platform, and has higher safety.

Description

Intelligent voice recognition method
Technical Field
The invention relates to voice recognition, in particular to an intelligent voice recognition method.
Background
At present, network security of the point-of-sale terminal, especially security of the smart POS device, is attracting attention, and security issues of information transmission through the smart POS device are receiving increasing attention. In the current application of the intelligent POS machine, user authentication of a user name and a password is adopted, a digital certificate is issued to the intelligent POS machine user, and the identity safety of the user is enhanced by utilizing the non-exportability of a hardware password terminal private key. However, any hardware password device needs to be an external entity device outside the intelligent POS machine, so that the usability of the scheme is further reduced, and the operation complexity of the user is increased. For prior art fingerprint identification, the identification information needs to be transmitted, and the security is challenged. And if the characteristic library stored by the payment platform is lost, the identity authentication cannot be carried out.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an intelligent voice recognition method, which comprises the following steps:
step 1: distinguishing silence and voice by taking the short-time power and ZCR as characteristics, and carrying out end point detection;
step 2: dividing the voice signal after the end point detection into a plurality of equal-length frames;
and step 3: obtaining voice signal characteristics through dynamic changes of audio power;
and 4, step 4: and performing user identity authentication of the intelligent POS machine based on the comparison result of the voice signal characteristics.
Preferably, the endpoint detection further comprises:
before detection, a threshold is determined for the short-time power and the ZCR, then the short-time power and the ZCR are continuously calculated, the threshold is adjusted, and whether the mute section is finished or not is judged through state analysis.
Preferably, in the end point detection, the frequency band is divided into 4 segments, and the power ratio SE of the sub-band is calculated according to the following formula:
Figure BDA0001601053940000021
wherein: u shapeiAnd LiRespectively representing the upper limit frequency and the lower limit frequency of a subband i, i being 1, 2, 3, 4; x (ω) represents the amplitude of the signal at frequency ω;
if the power ZCR of a frame signal is lower than the threshold and the SE of the 4 segments of sub-bands are approximately equal, the frame signal is judged to be a mute segment.
Preferably, the step 2 further comprises:
the speech signal is divided into R equal-length non-overlapping frames, denoted as fk={fk(n) | n ═ 1, 2, …, L/R; k ═ 1, 2, …, R }, where: l is the length of the voice signal; r is the total frame number; f. ofk(n) is the nth sample value of the kth frame.
Preferably, the step 3 further comprises:
and calculating the dynamic change of the audio power according to the power difference between the adjacent frames and the adjacent sub-bands thereof, wherein the dynamic change comprises the steps of carrying out power difference on the adjacent sub-bands, solving the difference value of the difference power of the adjacent frames, and carrying out threshold judgment.
Preferably, the step 4 further comprises:
in the voice authentication process, the similarity of voice signals is measured by using the blackman distance, and theta is measured for two audio segments1And theta2,h1Is recorded as a speech signal theta1Hash index value of h2Is recorded as a speech signal theta2The hash index value of (a); d is denoted by h1And h2The regularized blackman distance D, i.e., the ratio of the number of erroneous bits to the total number of bits of the hash index value,the calculation formula is as follows:
Figure BDA0001601053940000022
if two audio segments theta1And theta2Are the same as each other, then
Figure BDA0001601053940000023
If two audio segments theta1And theta2Are not the same, then
Figure BDA0001601053940000024
Wherein
Figure BDA0001601053940000025
To identify an authentication threshold. Distance if
Figure BDA0001601053940000026
Then two audio segments theta are considered1And theta2The characteristics of (A) are the same.
Compared with the prior art, the invention has the following advantages:
the invention provides an intelligent voice recognition method, which realizes local storage, comparison and operation of the identity authentication data of an intelligent POS machine terminal, does not need to configure hardware password equipment, does not need to upload the data to a payment platform, and has higher safety.
Drawings
Fig. 1 is a flow chart of an intelligent speech recognition method according to an embodiment of the present invention.
Detailed Description
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.
One aspect of the invention provides an intelligent speech recognition method. Fig. 1 is a flow chart of an intelligent speech recognition method according to an embodiment of the present invention.
The intelligent POS machine is connected with the payment platform through a safety channel. The intelligent POS machine obtains a voice recognition request started in advance from the payment platform. And judging whether the intelligent POS machine supports the voice recognition or not based on the recognition mode currently supported by the intelligent POS machine.
And if the voice recognition is supported, the intelligent POS machine client performs user identity verification by using the recognition result of the user voice.
If the verification is passed, the random number is encrypted by using a private key in an RSA secret key pair generated when the identity authentication is started to obtain a first encryption value, and the first encryption value is sent to the payment platform through the intelligent POS machine client, so that the payment platform can perform the identity authentication based on the first encryption value and a user public key obtained when the identity authentication is started.
In the user identity authentication process, the intelligent POS machine downloads an authentication request started by the current intelligent POS machine through the payment platform, the intelligent POS machine client finds the identification mode supported by the current intelligent POS machine, and the available authentication of the current intelligent POS machine is screened out according to the started authentication request and the authentication mode supported by the current intelligent POS machine and displayed to the user for the user to select and verify.
After the user is verified, the random number is encrypted by adopting a user private key in an RSA secret key pair generated by an authentication module of the intelligent POS machine in a secure environment when voice recognition is started, and the encrypted value is returned to the payment platform. And the payment platform verifies the validity of the encrypted value by using the user public key stored after the voice recognition is started.
After the encrypted value is obtained, whether the identity authentication is successful is judged according to whether the encrypted value is valid, and if the encrypted value is valid, the identity authentication is successful; if the encrypted value is invalid, the identity authentication fails.
Before starting authentication, an intelligent POS machine and a payment platform are required to negotiate an identification mode, and the specific authentication starting process comprises the following steps:
the intelligent POS machine acquires a negotiated identification mode from the payment platform; enumerating the current supported recognition mode of the intelligent POS machine, and judging whether the intelligent POS machine supports voice recognition;
if the intelligent POS machine client supports the voice recognition, the intelligent POS machine client carries out user identity verification by using the voice recognition; if the user identity passes the verification, the authentication module generates an RSA key pair in a secure environment, and encrypts a user public key in the RSA key pair by using an authentication module private key in the intelligent POS machine to generate a second encrypted value;
then, the authentication module uploads the second encrypted value and the user public key encrypted by the authentication module private key to the payment platform through the intelligent POS machine client, so that the payment platform uses the authentication module public key to verify whether the second encrypted value is valid.
In the process, the intelligent POS machine client is used for finding the identification mode supported by the current intelligent POS machine, screening out available authentication according to the identification mode supported by the current intelligent POS machine and displaying the available authentication to the user, after the user is verified, an authentication module of the intelligent POS machine generates an RSA secret key, and the public key and the started authentication request are returned to the authentication management platform for storage.
After voice recognition is started, an RSA key pair is generated in a trusted storage block of the intelligent POS machine, a user public key in the RSA key pair is exported, and the user public key is transmitted to the payment platform through an encryption transmission protocol. When the intelligent POS machine is used next time, after the authentication module completes identity verification, the private key in the RSA key pair stored in the trusted storage block is directly called to encrypt the abstract, and the encrypted value is transmitted to the payment platform to be verified.
The method comprises the steps of receiving a voice recognition request sent by an intelligent POS machine client through an interface of a trusted storage block, creating a corresponding recognition process according to the received identity recognition request, and managing the authentication module and the voice acquisition module to jointly complete the recognition process by executing the recognition process.
Specifically, when the payment platform receives a voice recognition request sent by the intelligent POS client through the interface of the trusted storage block, the payment platform creates a recognition process according to the voice recognition request, and sends a call instruction to the authentication module by executing the recognition process.
And secondly, after receiving the call instruction sent by the payment platform, the authentication module determines to return a collection instruction for calling the voice collection module to the payment platform according to the call instruction. So that the payment platform forwards the acquisition instruction to the voice acquisition module according to the acquisition instruction.
And then, the voice acquisition module calls a voice input device of the intelligent POS machine to acquire the voice fragment through an interface of the trusted storage block according to an acquisition instruction forwarded by the payment platform, and returns the acquired voice fragment to the authentication module through the payment platform.
The authentication module receives the voice fragments collected by the voice collection module forwarded by the payment platform. If the calling instruction sent by the payment platform carries the identity information to be recognized, the authentication module can create an association relationship between the voice fragment and the identity information to be recognized, and return the voice fragment and the identity information to be recognized to the payment platform as the voice information to be recognized. Or the authentication module extracts the user voice feature template to be recognized corresponding to the voice fragment according to a preset algorithm, then establishes the association relationship between the user voice feature template to be recognized and the identity information to be recognized, and returns the user voice feature template to be recognized and the identity information to be recognized as the voice information to be recognized to the payment platform.
And when the calling instruction sent by the payment platform does not carry the identity information to be identified, the authentication module can directly return the voice fragment to the payment platform, or the authentication module can return the extracted user voice feature template to be identified to the payment platform. The payment platform receives the voice fragment to be recognized or the user voice feature template to be recognized, when the payment platform receives the voice information to be recognized, the payment platform encrypts the voice information to be recognized according to a safety rule preset with the payment platform and then returns the encrypted voice information to the intelligent POS machine client through the interface of the credible storage block, when the payment platform receives the voice fragment or the user voice feature template to be recognized, the payment platform can determine corresponding identity information to be recognized according to the calling service and further determine the voice information to be recognized, and the encrypted voice information to be recognized is returned to the intelligent POS machine client through the interface of the credible storage block.
In a preferred embodiment of the present invention, the verifying the user identity by the intelligent POS client using the recognition result of the user voice further includes: verifying the input voice, and generating a pair of public and private keys for a user ID (identity) logging in a bank card reading program after the verification is passed, wherein the private keys are safely stored in a trusted storage block of the intelligent POS machine; encrypting a public key of a user ID, the user ID and a voice characteristic sequence of a login user ID by using a terminal private key built in a trusted storage block of the intelligent POS machine;
the terminal private key is preset in a safe storage area of the equipment when the intelligent POS machine leaves a factory; the public and private key pair of each POS machine has uniqueness;
when the voice of the login user ID is encrypted, the feature sequence of the voice is encrypted, the voice information generates the feature sequence when being stored in a trusted storage block of the intelligent POS machine, the feature sequence generation rule can be generated according to any suitable audio database retrieval rule, and the voice fragment corresponding to the feature sequence has uniqueness.
And sending the public key, the user ID and the voice characteristic sequence which are encrypted by a terminal private key as an authentication request to a payment platform, so that the payment platform verifies the public key after receiving the authentication request, and stores the public key, the user ID and the voice characteristic sequence.
The terminal private key is preset in a safe storage area of the equipment when a trusted storage block of the intelligent POS machine leaves a factory, so that a terminal public key in a payment platform can be sent to the payment platform by the intelligent POS machine terminal in advance for storage or is directly stored in the payment platform, and the terminal public key and the terminal private key are identified through an equipment unique identifier;
after the payment platform receives the authentication request, the information contained in the authentication request is encrypted by a terminal private key of the intelligent POS machine terminal, so that the payment platform retrieves a terminal public key corresponding to the terminal private key through the encrypted information to finish verification; after the verification is passed, storing the public key, the user ID and the voice characteristic sequence in the authentication request; and the payment platform feeds back the identification result to a trusted storage block of the intelligent POS machine.
After the registration is finished, when the registered user ID logs in the bank card reading program again, voice is input for verification operation; and encrypting the user ID and the characteristic sequence of the voice by a private key of the user ID stored in a trusted memory block of the intelligent POS machine.
And sending the authentication request containing the user ID and the voice feature sequence to a payment platform so that the payment platform can verify after receiving the authentication request, and checking whether the voice feature sequence in the authentication request is consistent with the voice feature sequence corresponding to the user ID during registration to obtain an authentication result.
If the authentication is not passed, the trusted storage block of the intelligent POS machine sends a re-authentication request, the payment platform can add the voice feature sequence which is not passed through the authentication into the authentication record, and the authentication is that the voice feature sequence which is inconsistent with the authentication in the registration process has the authority of using the bank card reading program service.
And for the voice characteristic sequence inconsistent with the voice characteristic sequence in the authentication request, if an execution verification code capable of executing the bank card reading program service is provided in the initiated re-authentication request, the voice characteristic sequence in the authentication request is stored in the authentication record, and the identity authentication is completed.
Before matching recognition is carried out on the voice, pre-emphasis, filtering, windowing, framing and end point detection are required. Silence and speech are distinguished by short-term power and ZCR. Before detection, a threshold is determined for the short-time power and the ZCR, then the short-time power and the ZCR are continuously calculated, the threshold is adjusted, state analysis is carried out, and whether the mute section is finished or not is judged.
In end point detection, the frequency band is divided into 4 segments, and the power ratio SE of the sub-bands is calculated as follows:
Figure BDA0001601053940000081
wherein: u shapeiAnd LiRespectively representing the upper limit frequency and the lower limit frequency of a subband i, i being 1, 2, 3, 4; x (ω) represents the amplitude of the signal at frequency ω.
If the power ZCR of a frame signal is lower than the threshold and the SE of the 4 segments of sub-bands are approximately equal, the frame signal is judged to be a mute segment.
Preferably, the detection of the voice signal endpoint is realized by combining a neural network and a particle swarm algorithm:
1: setting hidden nodes of a one-dimensional neural network to comprise K multiplied by L theta and K lambda, and K multiplied by N theta and N lambda contained in output layer nodes, wherein K is the number of the hidden nodes, L is the number of input nodes, N is the number of the output layer nodes, and theta and lambda are respectively a phase rotation coefficient and a phase control factor; initializing the related parameters of the particle swarm and the one-dimensional neural network;
2: randomly selecting a section of signal containing a voice section and noise, inputting short-time power, a circulating average amplitude difference function and a frequency band variance as a one-dimensional neural network, marking the beginning and the end of each frame of signal as the output of the one-dimensional neural network, and completing the construction of a one-dimensional neural network training sample;
3: inputting a training sample into a one-dimensional neural network for training, and optimizing the one-dimensional neural network through a particle group to enable the output and ideal output values of the one-dimensional neural network to meet the pre-design requirements, thereby completing the training of the one-dimensional neural network; the specific optimization steps of the one-dimensional neural network parameters are as follows:
1) initializing parameters to be optimized and learned; designing the motion position and the velocity vector of the particles for optimization into a matrix, wherein a row represents each parameter to be learned, and a column represents the motion particles for optimization;
2) to compute the output | Y > n of the entire one-dimensional neural network, a fitness function is defined as follows:
Figure BDA0001601053940000091
i O > n represents the expression of target output of the nth output neuron, and I Y > n is the expression of actual output of the nth output neuron;
3) updating the current speed and position of each particle through a speed and position formula of the particle swarm; the current velocity update for particle i is simplified as follows:
vt+1 i=vt i+c1r2-c2xt i
the update of the current position of particle i is simplified as follows:
xt+1 i=xt i+vt+1 i
r1and r2Is between [0,1]Independent random number in between, c1And c2Represents an acceleration limiting factor, wherein c1For adjusting the step size of the particles travelling to the respective optimum position, c2For adjusting the step size of individual travel to the global particle optimal position.
4) Calculating and evaluating the fitness of each particle so as to update the extreme value of the individual and the extreme value of the global situation;
5) when the end condition is met, obtaining the optimal values of the parameters theta and lambda of the hidden layer and the output layer of the one-dimensional neural network; then, storing the parameters, and ending the optimization process; otherwise, turning to 3) to continue searching;
after the neural network training is finished, calculating an original training sample by using the trained one-dimensional neural network, outputting a detection result, if the output result is greater than a threshold value, considering a current frame as a voice frame, otherwise, judging the current frame as a non-voice frame, then comparing an actual output result with a marked signal voice frame, and if the one-dimensional neural network training effect is not good, retraining the one-dimensional neural network;
carrying out voice endpoint detection; and taking a section of voice signal, extracting the characteristic quantity of the voice signal, detecting the voice signal by adopting a trained one-dimensional neural network, and finally outputting a voice endpoint detection result.
After the end point detection is finished, the voice signal is divided into R equal-length non-overlapped frames which are marked as fk={fk(n) | n ═ 1, 2, …, L/R; k ═ 1, 2, …, R }, where: l is the length of the voice signal; r is the total frame number; f. ofk(n) is the nth sample value of the kth frame.
After preprocessing, carrying out short-time Fourier transform on each frame of signal, and dividing sub-bands according to the following formula:
Bi=exp[(lgFmin+i(lgFmax-lgFmin)/M)]
wherein: i represents a subband number and takes the value of 1, 2, 3, …, M; m represents the number of sub-bands; fmin、FmaxFor the lower and upper limits of the auditory bandwidth, the bandwidth range of sub-band i is [ B ]i-1,Bi]. Calculating sub-band power B on each sub-bandiAnd obtaining M sub-band powers.
Calculating the dynamic change of the audio power by the power difference value between the adjacent frame and the adjacent sub-band:
E(k)n=e(k)n+1-e(k)n
dE(k)n=E(k+1)n-E(k)n
if dE (k)n≤0,F(k)n=0,
If dE (k)n>0,F(k)n=1,
Wherein: n-0, 1, 2, …, M-1, representing a sub-band number; k denotes a frame number.
Firstly, the power difference E (k) is made for the adjacent sub-bandsnThen, the difference value dE (k) is calculated for the differential power of the adjacent framesnA threshold value judgment is made to obtain a feature F (k)n
Will frequency range [0, fs/2]Dividing into N sub-bands, calculating the gravity center of the mth sub-band:
Figure BDA0001601053940000101
wherein: lm、hmThe lower limit frequency and the upper limit frequency of the sub-band; p (f) is the band power at f;
then regularizing the center of gravity of the sub-band to make the value thereof not influenced by the selection of the sub-band, as follows:
NCm=[Cm-(hm+lm)]/2(hm-lm)。
wherein NC ismRegularizing the subband center of gravity.
Mapping the original table entry to the hash index table by using a parameterized hash index table, and giving a fingerprint F (k)nObtaining a hash index value:
H(F(k)n)=F(k)nMaxlen
wherein: maxlen is the size of the hash index table; h (F (k)n) The hash index value is a value of 0-Maxlen-1;
computing a kth frame speech signal fk(n) short-time ZCR calculation yields the power ratio per frame:
Ck=Bk/(Rk+b),
wherein b is an anti-overflow constant, RkA short-time ZCR for the kth frame;
vector H ═ H (f (k) for power ration)Ck|k=1,2,…,R}。
The hash sequence H is then encrypted out of order. Firstly, a pseudo-random sequence S ═ S with the same length as the hash sequence is generated1,s2,…,sR]The hash sequence is then rearranged according to the values of the pseudo-random number sequence, the encrypted sequence being h(s)i)=h(i),
Wherein: h (i) is 1 only when H (i) > H (i-1), otherwise H (i) is 0.
In the voice authentication process, the similarity of voice signals is measured by using the blackman distance, and theta is measured for two audio segments1And theta2,h1Is recorded as a speech signal theta1Hash index value of h2Is recorded as a speech signal theta2The hash index value of (a). D is denoted by h1And h2Normalized blackman distance D, i.e.The ratio of the error bit number to the total bit number of the hash index value is calculated by the following formula:
Figure BDA0001601053940000111
if two audio segments theta1And theta2Are the same as each other, then
Figure BDA0001601053940000112
If two audio segments theta1And theta2Are not the same, then
Figure BDA0001601053940000113
Wherein
Figure BDA0001601053940000114
To identify an authentication threshold. Distance if
Figure BDA0001601053940000115
Then two audio segments theta are considered1And theta2The characteristics of the data are the same, and the authentication is passed; otherwise, the authentication is not passed.
In another preferred embodiment, the unregistered user may also register with the payment platform via a random voice string. Specifically, the payment platform generates a random character string and sends the random character string to the intelligent POS machine user; the user records the received random character string into voice and sends the voice to the payment platform; after the payment platform receives the voice of the user, extracting MFCC characteristics of the voice;
converting the voice into a character string text according to the MFCC characteristics of the voice, and if the obtained character string text is the same as the content of a pre-generated random character string, marking the section of voice as valid registration voice; otherwise, marking as invalid voice;
accordingly, in the verification phase: when an intelligent POS machine user sends an identity authentication request, a payment platform firstly generates a random character string and sends the random character string to the user, the user records the received random character string according to the sequence specified by the payment platform to obtain authentication voice, and the generated authentication voice is sent to the payment platform; if the user fails to input the voice within a certain duration, the current random character string is invalid, and the user authentication fails;
after receiving the authentication voice, the payment platform extracts the MFCC characteristics of the authentication voice; verifying whether the user characteristics of the authentication voice belong to the current user and whether the content conforms to the correct character string text, and respectively obtaining voice matching values S1And text matching value S2
Matching the speech to a value S1Matching the value S with the text2And obtaining a final score after weighting and summing, comparing with a set threshold value and judging: when the final score exceeds a set threshold value, the registered user of the authenticated voice from the intelligent POS machine is considered and the text content of the voice is correct, and the verification is passed; otherwise, the verification fails;
the final score is calculated as follows:
S=wS1+(1-w)S2
wherein S is the final score, w is the weight, 0< w <1
Wherein, the verifying whether the user characteristics of the authentication voice belong to the current user and whether the content conforms to the correct character string text further comprises:
constructing a first HMM in the order of the correct string text;
according to the MFCC features of the authentication voice and the first HMM, obtaining a mapping between the MFCC features of the authentication voice and the first HMM state by adopting a Viterbi algorithm, so that:
Φ*t=argmaxΦp(Xt|H,Φt)
in the formula, XtMFCC feature set { x ] for authenticated Voicet(1),xt(2),...,xt(Nt)},NtFor the total number of the authentication voice features, the subscript t represents the authentication voice segment, H is the first HMM, ΦtTo authenticate the mapping of voice MFCC features to HMM states, p (X)t|H,Φt) Representing authenticated Voice MFCC feature set XtIn the first HMM and the state correspondence mode ΦtOverall likelihood probability of [ phi ]tIs a ViterThe optimal mapping between the MFCC features of the authenticated speech found by the bi algorithm and the first HMM state;
according to the mapping between the MFCC feature of the authentication voice and the first HMM state, the mapping between the MFCC feature of the authentication voice and each character is further obtained, and the log-likelihood ratio of the authentication voice in the GMM model of the specific user voice and the general GMM model is calculated as the voice matching value S1(ii) a Speech match value S1The calculation expression of (a) is as follows:
Figure BDA0001601053940000131
in the formula, xt(n) is the nth frame MFCC feature for authenticated speech,
Figure BDA0001601053940000132
representing the number of MFCC features corresponding to all character texts in the authentication speech, d (n) representing the characters corresponding to the MFCC features of the nth frame of the authentication speech under the condition of correct character string text, and Λ0 d(n)And Λd(n)The characters d (n) correspond to a specific user GMM model and a general GMM model, p (x)t(n)|Λ0 d(n)And p (x)t(n)|Λd(n)) Xt (n) is the overall likelihood probability of the two GMM models;
identifying the character string content of the authentication voice, and taking the character string content obtained by verification as an optimal character string; constructing a second HMM using the generic GMM model according to the optimal string;
obtaining the mapping between the MFCC characteristics of the authentication voice and the second HMM state by adopting a Viterbi algorithm, and further obtaining the mapping between the MFCC characteristics of the authentication voice and each character;
according to the obtained mapping of the MFCC features of the authentication voice under the correct character string text and the optimal character string and each character, calculating the log-likelihood ratio on the GMM model and the general GMM model of the specific user voice of the authentication voice as a text matching value S2(ii) a Text matching value S2The calculation expression of (a) is as follows:
Figure BDA0001601053940000133
in the formula (I), the compound is shown in the specification,
Figure BDA0001601053940000134
representing the number of MFCC features corresponding to the optimal character text in the speech for authentication, d2(n) is the character corresponding to the MFCC feature of the nth frame of the authentication voice under the condition of the optimal character string,
Figure BDA0001601053940000135
is d2(n) a corresponding generic GMM model,
Figure BDA0001601053940000141
is xt(n) at d2(n) overall likelihood probability over the generic GMM model.
To eliminate the effect of channel mismatch, in estimating the user identification model, modeling is performed simultaneously in the user identification space and the channel space based on factor analysis. I.e. a piece of speech is represented by a complex vector, i.e. the speech space may consist of complex vectors of users and channels.
The complex vector M is represented by the following formula:
M=s+c
s=m+Vy+Dz
c=Ux
where s is the user feature space vector, c is the channel space vector, m is the generic GMM vector, V, D and U are the space matrices. The component of the vector x serves as the channel factor, the component of y serves as the user identification factor, and the component of z is referred to as the residual factor. The process of factor analysis is completed by estimating the matrix of the space, building a user identification model and testing.
In the spatial matrix estimation process, a speech output user and a speech feature vector x are given1,x2,…,xTGet the following:
Figure BDA0001601053940000142
Figure BDA0001601053940000143
Figure BDA0001601053940000144
wherein m iscRepresenting the mean subvector, x, of the channel ct,γt(c) Is the state probability, N, of each GMM functionc(s),Fc(s),Sc(s) are statistics of user s at zero, first, and second order, respectively, on the c-th GMM.
And then, splicing the statistics: n is a radical ofc(s) diagonal matrix N(s), F spliced to CF × CFc(S) are concatenated into a CF × 1, column vector F (S), Sc(s) the diagonal matrix S(s) CF is spliced, CF being the dimension of the generic GMM vector.
And then calculating the intermediate variable of each user:
L(s)=VTΨ-1N(s)V,
where Ψ is the covariance matrix of the generic GMM;
calculating first and second order expectation values for the user identification factor y(s) using l(s):
E[y(s)]=L-1(s)VTΨ-1F(s),
E[y(s)y-1(s)]=E[y(s)]E[yT(s)]+L-1(s)
n(s), F(s), S(s) are statistics of zero order, first order and second order of the characteristic space vector of the user s respectively;
updating the user identification space matrix V and the covariance matrix Ψ s
V=ΨsF(s)E[yT(s)]/(∑sN(s)E[y(s)yT(s)]),
Ψnew=[ΨsN(s)]-1sS(s)-diag{ΨsF(s)E[yT(s)]VT}}。
In summary, the invention provides an intelligent voice recognition method, which realizes local storage, comparison and operation of the intelligent POS terminal identity authentication data, does not need to configure a hardware password device, does not need to upload to a payment platform, and is more secure.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing system, centralized on a single computing system, or distributed across a network of computing systems, and optionally implemented in program code that is executable by the computing system, such that the program code is stored in a storage system and executed by the computing system. Thus, the present invention is not limited to any specific combination of hardware and software.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (3)

1. An intelligent speech recognition method, comprising:
step 1: distinguishing silence and voice by taking the short-time power and ZCR as characteristics, and carrying out end point detection;
step 2: dividing the voice signal after the end point detection into a plurality of equal-length frames;
and step 3: obtaining voice signal characteristics through dynamic changes of audio power;
and 4, step 4: performing user identity authentication of the intelligent POS machine based on the comparison result of the voice signal characteristics;
the endpoint detection further comprises:
before detection, determining a threshold for the short-time power and the ZCR, continuously calculating the short-time power and the ZCR, adjusting the threshold, and judging whether a mute section is finished or not through state analysis;
in the end point detection, the frequency band is divided into 4 segments, and the power ratio SE of the sub-band is calculated according to the following formula:
Figure FDA0003030289220000011
wherein: u shapeiAnd LiRespectively representing the upper limit frequency and the lower limit frequency of a subband i, i being 1, 2, 3, 4; x (ω) represents the amplitude of the signal at frequency ω;
if the power ZCR of a frame signal is lower than the threshold and the SE of the 4 segments of sub-bands are approximately equal, the frame signal is judged to be a mute segment.
2. The method of claim 1, wherein the step 2 further comprises:
the speech signal is divided into R equal-length non-overlapping frames, denoted as fk={fk(n) | n ═ 1, 2, …, L/R; k ═ 1, 2, …, R }, where: l is the length of the voice signal; r is the total frame number; f. ofk(n) is the nth sample value of the kth frame.
3. The method of claim 1, wherein the step 3 further comprises:
and calculating the dynamic change of the audio power according to the power difference between the adjacent frames and the adjacent sub-bands thereof, wherein the dynamic change comprises the steps of carrying out power difference on the adjacent sub-bands, solving the difference value of the difference power of the adjacent frames, and carrying out threshold judgment.
CN201810224944.5A 2018-03-19 2018-03-19 Intelligent voice recognition method Active CN108447491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810224944.5A CN108447491B (en) 2018-03-19 2018-03-19 Intelligent voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810224944.5A CN108447491B (en) 2018-03-19 2018-03-19 Intelligent voice recognition method

Publications (2)

Publication Number Publication Date
CN108447491A CN108447491A (en) 2018-08-24
CN108447491B true CN108447491B (en) 2021-08-10

Family

ID=63195147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810224944.5A Active CN108447491B (en) 2018-03-19 2018-03-19 Intelligent voice recognition method

Country Status (1)

Country Link
CN (1) CN108447491B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065076B (en) * 2018-09-05 2020-11-27 深圳追一科技有限公司 Audio label setting method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000347686A (en) * 1999-06-03 2000-12-15 Toshiba Tec Corp Voice processing device, service quality improvement assisting device using it, and goods sales control device
CN102023604A (en) * 2010-11-24 2011-04-20 陕西电力科学研究院 Intelligent online monitoring system capable of preventing external damage on transmission line
CN107104803B (en) * 2017-03-31 2020-01-07 北京华控智加科技有限公司 User identity authentication method based on digital password and voiceprint joint confirmation

Also Published As

Publication number Publication date
CN108447491A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
US11545155B2 (en) System and method for speaker recognition on mobile devices
US10825452B2 (en) Method and apparatus for processing voice data
US20180047397A1 (en) Voice print identification portal
KR102601279B1 (en) Remote usage of locally stored biometric authentication data
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
Monrose et al. Using voice to generate cryptographic keys
US8384516B2 (en) System and method for radio frequency identifier voice signature
WO2016015687A1 (en) Voiceprint verification method and device
US9106422B2 (en) System and method for personalized security signature
JP2016511475A (en) Method and system for distinguishing humans from machines
US20060229879A1 (en) Voiceprint identification system for e-commerce
CN111897909B (en) Ciphertext voice retrieval method and system based on deep perceptual hashing
CN108550368B (en) Voice data processing method
CN108416592B (en) High-speed voice recognition method
Nagakrishnan et al. A robust cryptosystem to enhance the security in speech based person authentication
US11611881B2 (en) Integrated systems and methods for passive authentication
CN111710340A (en) Method, device, server and storage medium for identifying user identity based on voice
CN108447491B (en) Intelligent voice recognition method
EP3373177B1 (en) Methods and systems for determining user liveness
Nagakrishnan et al. Novel secured speech communication for person authentication
Aloufi et al. On-Device Voice Authentication with Paralinguistic Privacy
ABDUL-HASSAN et al. CENTRAL INTELLIGENT BIOMETRIC AUTHENTICATION BASED ON VOICE RECOGNITION AND FUZZY LOGIC
CN114023334A (en) Speaker recognition method, speaker recognition device, computer equipment and storage medium
KR100786665B1 (en) Voiceprint identification system for e-commerce
Petry et al. Speaker recognition techniques for remote authentication of users in computer networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant