WO2017162053A1

WO2017162053A1 - Identity authentication method and device

Info

Publication number: WO2017162053A1
Application number: PCT/CN2017/076336
Authority: WO
Inventors: 朱长宝; 李欢欢; 袁浩; 王金明
Original assignee: 中兴通讯股份有限公司
Priority date: 2016-03-21
Filing date: 2017-03-10
Publication date: 2017-09-28
Also published as: CN107221333B; CN107221333A

Abstract

An identity authentication method, comprising: acquiring a voice characteristic of an input voice, and matching the voice characteristic to a pre-stored target voiceprint model, obtaining a voiceprint match score (11); segmenting the input voice according to the voice characteristic and a target text model, and acquiring a number of initial segmentation units and initial voice segmentation units (12) - if the number of initial voice segmentation units is greater than or equal to a first threshold, performing forced segmentation on the initial segmentation units, causing the total number of segmentation units to be equal to the number in a preset target text; matching the voice characteristic of each segmentation unit with every target text model, obtaining a segmentation unit text match score for each segmentation unit and each target text model (13); performing identity authentication according to the segmentation unit text match scores, voiceprint match scores and a pre-trained probabilistic neural network (PNN) classifier (14). The present method realizes the goal of two-factor authentication of a user, increasing system security.

Description

Method and device for identity authentication

Technical field

This document refers to, but is not limited to, the field of biosafety dynamic authentication technology, especially a method and device for identity authentication.

Background technique

With the continuous development of Internet information technology, online business and e-commerce have become increasingly prosperous, people are more and more connected with computer networks, and various network security threats have followed. Protecting users' personal information has become an urgent problem for people. . The dynamic voiceprint password recognition technology combines the two identification technologies of speaker recognition and speech recognition, which can effectively prevent recording attacks and greatly enhance the security of the system. Generally, after receiving the voice containing the password, the system first calculates the scores for the voiceprint and the dynamic password respectively, and then compares the two scores with the threshold size respectively, or combines the two scores to determine the size of the integrated threshold. If the threshold is greater than the preset threshold, the requester enters the protected system, otherwise, it refuses to enter. However, in practical applications, due to the influence of the environment, the speaker's voiceprint matching score distribution and the text matching score distribution tend to be different, and only using the preset threshold to judge is the accuracy.

Summary of the invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a method for identity authentication, including:

Obtaining a voice feature of the input voice, and matching the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

And segmenting the input voice according to the voice feature and the preset target text model, and acquiring the number of the initial segmentation unit and the initial voice segmentation unit, where the number of the initial voice segmentation units is smaller than the first And determining, by the threshold, that the input voice is an illegal voice; if the number of the initial voice segmentation units is greater than or equal to the first threshold, performing forced segmentation on the initial segmentation unit, so that the total number of the segmentation units is The number is the same as the number of preset target texts;

Matching the speech features of each of the segmentation units to all of the target text models, Obtaining a segmentation unit text matching score of each of the segmentation unit and each of the target text models;

Identity authentication is performed according to the segmentation unit text matching score, the voiceprint matching score, and a pre-trained probabilistic neural network PNN classifier.

Optionally, the PNN classifier is trained in the following manner:

Matching the target speech with the target text model and the target voiceprint model to obtain a first text score and a first voice score, respectively, and combining the first text score and the first voice score into the decision classifier. Accept feature information;

Matching the non-target speech with the target text model and the target voiceprint model to obtain a second text score and a second voice score, respectively, and combining the second text score and the second voice score into the decision classifier Rejection characteristic information;

The PNN classifier is trained according to the acceptance feature information and the rejection feature information.

Optionally, before training the PNN classifier according to the accepting feature information and the reject feature information, further comprising performing score regularization on voiceprint scoring and text scoring of the target speech and the non-target speech. ,include:

Selecting the target text model in turn, and matching the phonetic features of the non-target text with the corresponding target text model, obtaining a pseudo-text score, and obtaining the mean and standard deviation of the pseudo-text score corresponding to the target text model;

Subdividing the first text score and the second text score respectively by the mean value of the corresponding spoofed text scores and dividing by the standard deviation, respectively obtaining a regular text score;

Combining the normalized first text score and the first voiceprint score, obtaining a maximum value and a minimum value corresponding to each target text; using the maximum value and the minimum value to score the normalized first text and the first The voiceprint score is normalized as the acceptance feature information of the PNN classifier;

Combining the normalized second text score and the second voiceprint score, obtaining a maximum value and a minimum value corresponding to each target text; using the maximum value and the minimum value to score the regular second text and the second The voiceprint score is normalized as the rejection feature information of the PNN classifier.

Optionally, the input language is according to the voice feature and a preset target text model The sound is segmented to obtain the initial segmentation unit, including:

Correlating the corresponding target text hidden Markov model HMM into a first composite HMM according to the target text sequence in the target password;

Performing Viterbi decoding as the input of the first composite HMM to obtain a first state output sequence, and corresponding to a state in the first state output sequence that is an integer multiple of a state number of a single target text HMM Position as the initial segmentation point;

The adjacent two initial segmentation points are sequentially selected as a range start and end point, in which the average energy is calculated in units of specified frames, and the point where the average energy continuously increases by a specified number of times is found, and the point at which the increase is started is started. As the new initial segmentation point, the initial segmentation unit is divided by the initial segmentation point.

Optionally, the corresponding target text HMM is combined into the first composite HMM, including:

The state number of the first composite HMM is a sum of state numbers of a single target text HMM; a Gaussian mixture model parameter of each state of the first composite HMM and a Gaussian mixture model parameter of each state of the single target text HMM model the same;

The last state self transition probability in the state transition matrix of the single target text HMM is set to 0, the state transition probability transferred to the next state is set to 1; the state transition of the last single target text HMM of the target text The probability matrix is not changed;

The state transition probability matrices of the single target text HMM are merged according to a single target text arrangement order of the target text, to obtain a state transition probability matrix of the composite HMM.

Optionally, the performing the initial segmentation unit to perform the forced segmentation, so that the total number of the segmentation units is the same as the preset target text number, including:

The initial segmentation unit having the longest feature segment is selected to perform forced segmentation, so that the total number of all segmentation cells after the forced segmentation is the same as the preset target text number.

The forced splitting is started according to the length of the initial splitting unit from the largest to the smallest, and each of the initial splitting units is divided into two segments at a time, until the total number of the splitting units after the splitting is equal to The number of target texts;

If the number of forced segmentation is greater than or equal to the second threshold, the forced segmentation ends; if the number of forced segmentation is less than the second threshold, each current segmentation unit is separately associated with each target text hidden Markov model The HMM performs matching scoring, respectively selecting the target text HMM corresponding to the highest scoring, and combining the selected target text HMM into a second composite HMM; and using the speech feature as an input of the second composite HMM The Viterbi decoding obtains a second state output sequence, and the position corresponding to the state of the second state output sequence that is an integer multiple of the state number of the single target text HMM is used as a segmentation point, and the segmentation point is used for the speech The different unit obtained by the feature segmentation is the segmentation unit. If the current number of the segmentation units is less than the third threshold, the segmentation unit after the current segmentation is used as the initial segmentation unit to continue the forced segmentation. If the current number of the segmentation units is greater than or less than the third threshold, the forced segmentation ends.

Optionally, the voice feature of each of the segmentation units is matched with all the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models. ,include:

The speech feature of each of the segmentation units is used as an input of each target text hidden Markov model HMM, and the output probability obtained according to the Viterbi algorithm is used as a corresponding segmentation unit text matching score.

Optionally, the performing the identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and the pre-trained decision classifier includes:

Taking the text corresponding to the m highest scores in the segment matching unit text matching score corresponding to each of the segmentation units as the candidate text, and if the candidate text includes the target text corresponding to the segmentation unit, The segmentation unit passes the authentication, and calculates the total number of the splitting units that pass. If the total number of the splitting units passed is less than or equal to the fourth threshold, the text authentication fails, and the identity authentication fails; if the total number of the splitting units passed is greater than The fourth threshold, the text authentication of the input voice passes;

Determining whether the voiceprint matching score is greater than a fifth threshold, if yes, the voiceprint authentication is passed, and the identity authentication is passed; if not, the text of each of the segmentation units and the corresponding target text model is scored and the voiceprint is The matching score is scored and the regularized score is used as the input of the decision classifier for identity authentication.

An embodiment of the present invention further provides an apparatus for identity authentication, including a probabilistic neural network PNN classifier, including:

The voiceprint matching module is configured to acquire a voice feature of the input voice, and match the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

a segmentation module, configured to segment the input speech according to the voice feature and a preset target text model, and obtain an initial segmentation unit and a number of initial voice segmentation units, such as the initial voice segmentation unit If the number of the input speech is less than the threshold, the input speech is determined to be an illegal speech; if the number of the initial speech segmentation units is greater than or equal to the first threshold, the initial segmentation unit is forcibly segmented so that the segmentation is performed. The total number of units is the same as the number of preset target texts;

a text matching module, configured to match a voice feature of each of the segmentation units with all of the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models ;

An authentication module is configured to perform identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and the pre-trained PNN classifier.

Optionally, the device further includes a processing module,

The voiceprint matching module is configured to match the target voice with the target voiceprint model to obtain a first voiceprint score, and match the non-target voice with the target voiceprint model to obtain a second voiceprint score;

The text matching module is configured to match the target speech with the target text model to obtain a first text score, and match the non-target speech with the target text model to obtain a second text score;

The processing module is configured to combine the first text score and the first voice score into the acceptance feature information of the PNN classifier, and combine the second text score and the second voice score into the PNN Rejection feature information of the classifier;

The PNN classifier performs training according to the acceptance feature information and the rejection feature information.

Optionally, the processing module is further configured to sequentially select the target text model, and the voice feature of the non-target text is matched with the corresponding target text model, and the recognized text is scored. Obtaining a mean value and a standard deviation of the spoofed text score corresponding to the target text model; and subtracting the average of the corresponding spoofed text scores by the first text score and the second text score respectively and dividing by the The standard deviation is respectively scored by the regular text; the first text score after the regularization is scored and the first voiceprint is scored, and the maximum value and the minimum value corresponding to each target text are obtained; the maximum value and the minimum value are used to be regularized. The subsequent first text score and the first voiceprint score are normalized as the acceptance feature information of the PNN classifier; the normalized second text score and the second voiceprint score are merged, and each is acquired The maximum and minimum values corresponding to the target text; the normalized second text score and the second voiceprint score are normalized by the maximum and minimum values as the rejection feature information of the PNN classifier.

Optionally, the segmentation module performs segmentation on the input voice according to the voice feature and a preset target text model, and obtains an initial segmentation unit, including: corresponding to the target text sequence in the target password, The target text hidden Markov model HMM is combined into a first composite HMM; the voice feature is Viterbi decoded as an input of the first composite HMM to obtain a first state output sequence, and the first state output sequence is The position corresponding to the state of the integer multiple of the state number of the single target text HMM is taken as the initial segmentation point; the adjacent two initial segmentation points are sequentially selected as the interval start and end points, and within the interval, the specified frame is The unit calculates the average energy, finds the point at which the average energy continuously specifies the second increase, and takes the point that starts to increase as the new initial segmentation point, and the initial segmentation unit is divided into the initial segmentation points.

Optionally, the sharding module combines the corresponding target texts HMM into the first composite HMM, including: the state number of the first composite HMM is the sum of the states of the single target text HMM; the first composite HMM Each state has a Gaussian mixture model parameter that is the same as a Gaussian mixture model parameter that each state of the single target text HMM has; the last state self-transition probability in the state transition matrix of the single target text HMM is set to 0, the state transition probability of transitioning to the next state is set to 1; the state transition probability matrix of the last single target text HMM of the target text is not changed; the state transition probability matrix of the single target text HMM is according to the target The single target text arrangement order of the text is merged to obtain a state transition probability matrix of the composite HMM.

Optionally, the segmentation module performs forced segmentation on the initial segmentation unit to make a slice The total number of sub-units is the same as the number of preset target texts, including: selecting the longest primary segmentation unit to perform forced segmentation, so that the total number of all the segmentation units after forced segmentation Same as the preset target text.

Optionally, the segmentation module performs forced segmentation on the initial segmentation unit, such that the total number of segmentation units is the same as the number of preset target texts, including: according to the initial segmentation unit The length of the length is split from the largest to the smallest, and each time the initial splitting unit is divided into two segments at an average, until the total number of units after the segmentation is equal to the number of the target text; If the number of times of segmentation is greater than or equal to the second threshold, the segmentation is forced to end; if the number of forced segmentation is less than the second threshold, then each of the currently segmented cells and each target text hidden Markov model HMM Performing matching scoring, respectively selecting the target text HMM corresponding to the highest scoring, combining the selected target text HMM into a second composite HMM; and using the speech feature as an input of the second composite HMM Ratio decoding, obtaining a second state output sequence, the position corresponding to the state of the second state output sequence that is an integer multiple of the number of states of the single target text HMM is used as a segmentation point, and the segmentation point is used for the speech The different unit obtained by the segmentation is the segmentation unit. If the current number of the segmentation units is less than the third threshold, the segmentation unit after the current segmentation is used as the initial segmentation unit to continue the forced segmentation. If the current number of the segmentation units is greater than or equal to the third threshold, the forced segmentation ends.

Optionally, the text matching module matches the voice features of each of the segmentation units with all the target text models to obtain a segmentation of each of the segmentation units and each of the target text models. The unit text matching score includes: using the voice feature of each of the segmentation units as an input of each target text hidden Markov model HMM, and using the output probability obtained according to the Viterbi algorithm as the corresponding segmentation unit text matching score .

Optionally, the authentication module performs identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and the pre-trained decision classifier, including: taking the corresponding to each of the segmentation units The text corresponding to the m highest scores in the segmentation unit text matching score is used as the candidate text. If the target text corresponding to the segmentation unit is included in the to-be-selected text, the segmentation unit passes the authentication, and the cut is calculated. The total number of sub-units, if the total number of split units passed is less than or equal to the fourth threshold, the text authentication will not pass, and the identity authentication will not pass; If the total number of the sub-units is greater than the fourth threshold, the text authentication of the input voice passes; determining whether the voiceprint matching score is greater than a fifth threshold, and if so, the voiceprint authentication is passed, and the identity authentication is passed; if not, the voice authentication is passed; Each of the segmentation units and the corresponding target text model scores a score and the voiceprint matching scores are scored, and the normalized scores are used as input of the PNN classifier for identity authentication.

The embodiment of the invention further provides a computer readable storage medium storing computer executable instructions for the above method for identity authentication.

In summary, the embodiments of the present invention provide a method and device for identity authentication, which combines voiceprint and dynamic password authentication to achieve dual authentication for users, and improves system security, reliability, and accuracy. Sex.

DRAWINGS

FIG. 1 is a flowchart of a method for identity authentication according to an embodiment of the present invention;

2 is a flow chart of a method of training a PNN classifier according to an embodiment of the present invention;

3 is a flowchart of a method for identity authentication according to Embodiment 1 of the present invention;

4 is a flowchart of a method for initial segmentation of a speech signal according to Embodiment 1 of the present invention;

FIG. 5 is a flowchart of a method for initial authentication of voiceprint and text according to Embodiment 1 of the present invention; FIG.

6 is a flowchart of a method for score regularization according to Embodiment 1 of the present invention;

7 is a flowchart of a method for identity authentication according to Embodiment 2 of the present invention;

8 is a flowchart of a method for initial segmentation of a voice signal according to Embodiment 2 of the present invention;

FIG. 9 is a schematic diagram of an apparatus for identity authentication according to an embodiment of the present invention.

detailed description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

FIG. 1 is a flowchart of a method for identity authentication according to an embodiment of the present invention. As shown in FIG. 1 , the method in this embodiment includes the following steps:

Step 11: Acquire a voice feature of the input voice, and match the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

Step 12: Perform segmentation according to the voice feature and a preset target text model, and obtain an initial segmentation unit and an initial voice segmentation unit, such as the number of the initial voice segmentation units. If the threshold is less than the threshold, the input voice is determined to be an illegal voice, and the process ends. If the number of the initial voice segmentation units is greater than or equal to the first threshold, the initial segmentation unit is forced to be segmented, so that the segmentation is performed. The total number of units is the same as the number of preset target texts;

Step 13: Matching the voice features of each of the segmentation units with all the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models;

Step 14. Perform identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and a pre-trained PNN (Probabilistic Neural Networks) classifier.

An identity authentication method provided by the embodiment of the invention combines voiceprint and dynamic password authentication to achieve the purpose of double verification for the user, and improves the security, reliability and accuracy of the system.

In this embodiment, the PNN classifier needs to be trained in advance, and the target text model and the target voiceprint model are acquired according to the existing voice; the existing voice is matched with the target text model and the target voiceprint model to obtain a text score and The voiceprint is scored, and the feature information and the rejection feature information are synthesized according to the voiceprint score and the text group, and the information acceptance feature and the rejection feature information are trained as input of the integrated PNN decision classifier to obtain a final synthesis. Decision classifier; implementation is as follows:

The target voice is a voice of the target speaker reading the target text, and the non-target voice is a voice of the target speaker reading non-target text and a voice of a non-target speaker.

Optionally, the voice score and the text score are scored before the training of the integrated classifier, for example, including the following steps:

a. Select the target text model in turn, and take the non-target text speech feature to match the target text model, and obtain the fake text score;

b. Find the mean value and standard deviation of the falsified text corresponding to the target text model;

c. The first text score and the second text score are respectively subtracted from the average of the corresponding text of the short text and divided by the standard deviation, respectively, and the regular text score is obtained;

d. Combining the voice scores and the regularized text scores, obtaining the maximum and minimum values corresponding to each target text, and using the maximum and minimum values in step d to score the voiceprint scores and texts. Normalize; for example:

For convenience of description in this embodiment, the following definitions are made:

Target text: text selected as an alternate password in advance, such as 0-9 digits;

Target speaker: A person who is trusted by the system, who needs to pass the voiceprint authentication;

A caller: a system that is not trusted, who needs to reject the entry when voiceprint authentication;

Target password: A combination of the target texts trusted by the system, which needs to be passed when the text is authenticated;

False password: A system-untrusted combination of text that needs to be rejected for text authentication.

Before the system performs authentication, it is necessary to select the target text set and train each target text in the target text set to obtain the target text model set. In the following embodiment, the target text set is selected as: 0 to 9 ten numbers, and the target model set is composed of 0 to 9 ten numbers trained models, and the target model type may be HMM (Hidden Markov Model). For convenience of description, the dynamic password is composed of 8 out of 0 to 9 ten digits, that is, the system selects 8 target texts as the target password. At the same time, before the system is authenticated, it is necessary to register the voiceprint information of the target speaker, generate a voiceprint model through training, and train the comprehensive decision classifier through the voiceprint model and the target model, as shown in FIG. 2, including the following steps:

Step 001: Training target text model: training a single digital HMM using digital recordings of 0-9, each digital model is called a target text model, and the training method can use an existing training method;

HMM is a double stochastic process, one process is used to describe the time-varying of the short-term stationary signal, and the other process is used to describe the correspondence between the state number of the HMM model and the feature sequence. The interaction of the two processes not only describes the dynamic characteristics of the speech signal, but also solves the transition problem between short-term stationary signals.

Step 002: Register the target voiceprint model: Before the system is used, register the target voiceprint model in advance, and the target speaker is the system trusted speaker, and needs to pass it when authenticating;

Step 003: Acquire the feature: use the voice corresponding to the target text of the target caller to match the corresponding HMM, and obtain the target text to receive the score; use the voice corresponding to the target text of the target caller to score the voice pattern of the target speaker. Get the target speaker's voiceprint to receive the score; a series of target voiceprints receive the score and the target text accepts the score to form the acceptance characteristics of the comprehensive classifier, corresponding to the integrated classifier output is 1;

Step 004: Reject the feature: use the voice corresponding to the target text to match the non-corresponding HMM model, and obtain the rejection score of the fake text; use the caller and the target voiceprint model to score, and obtain the voice recognition and reject the score. Rejecting the score by a series of falsified texts and rejecting the swearing and rejecting the scores to form a rejection feature of the comprehensive classifier, corresponding to the output of the integrated classifier being 0;

Step 005: Train the classifier: combine the acceptance feature and the rejection feature of the integrated classifier, and perform the score regularization (see step 109) as the training input of the classifier. A comprehensive classifier can be obtained according to an existing training algorithm such as a gradient descent algorithm.

Embodiment 1:

As shown in Figure 3, the following steps are included:

Step 101: Pre-processing: pre-processing the test voice input by the user according to the short-time energy and the short-time zero-crossing rate, and removing the non-speech segment in the voice;

Step 102: Feature parameter extraction: extracting characteristic parameters of the pre-processed test speech, the system may adopt a 12-dimensional Melfield Cepstrum Coefficient (MFCC) and its first-order differential coefficient as characteristic parameters. , a total of 24 dimensions;

Step 103: Calculating a voiceprint matching score: matching the test voice feature with the voiceprint model of the target speaker to obtain a voiceprint matching score;

Step 104: Initially segment the speech feature: obtain an initial segmentation unit and an initial segmentation unit number by initial segmentation of the test speech feature.

In this embodiment, the corresponding target text HMM is combined into a composite HMM according to the target text sequence in the target password;

Performing Viterbi decoding as the input of the composite HMM to obtain a first state output sequence, and correspondingly indicating a state in which the number of states of the single target text HMM is an integer multiple of the first state output sequence Position as the initial cut point;

The adjacent two initial segmentation points are sequentially selected as a range start and end point, in which the average energy is calculated in units of specified frames, and the point where the average energy continuously increases by a specified number of times is found, and the point at which the increase is started is started. As a new initial segmentation point, otherwise, the initial segmentation point is not updated, and the initial segmentation unit is divided by the initial segmentation point.

The state number of the composite HMM is a sum of state numbers of a single target text HMM; each state of the composite HMM has a Gaussian mixture model parameter and a Gaussian mixture model parameter of each state of the single target text HMM the same,

Determining a state transition probability matrix of the single target text HMM according to the target text A single target text arrangement order is merged to obtain a state transition probability matrix of the composite HMM.

The method for initial segmentation of speech features is shown in Figure 4, including the following steps:

Step 104a: Combination of composite HMM models: Combine the corresponding single target texts HMM into a composite HMM model according to the target text sequence in the target password.

Assuming that each number of HMM models has 8 states, each state is fitted by 3 Gaussian functions, then the state number of the composite HMM model is the sum of the number of states of a single target text HMM model, and each state is still composed of 3 The Gaussian function is fitted, and its Gaussian mixture model parameters are the same as the Gaussian mixture model parameters of each state of the single HMM model. The changes of the state transition probability matrix parameters of the composite HMM are connected into a composite HMM by three single target text HMM models. For example, the number of state of a single target text HMM model in this example is 3, as shown in the following equation:

When synthesizing a composite HMM model, each state matrix is rewritten as follows:

Then the state transition probability matrix of the composite HMM model is:

Step 104b, Viterbi decoding: matching the feature sequence obtained in step 102 with the composite HMM model obtained in step 104a by Viterbi decoding to obtain an optimal state output sequence, so that each frame feature has its corresponding status;

Step 104c: Find an initial segmentation point: it can be known from step 104a that the number of states of the single digital HMM model is 8, and the position of the optimal state output sequence obtained in step 104b is found as the initial segmentation point P. (i);

Step 104d: Update the initial segmentation point: sequentially select two adjacent initial segmentation points P(i-1) and P(i) in step 104c, and respectively serve as a starting point and a termination point of the interval. In this interval, each K frame is composed of a segment, a total of L segments, the average energy of each segment is E(n), n is the segment index number, and S(n-1)=E(n)-E(n-1) is calculated. n=2...L, search backward from the index number of S(n1)>0, n1=1...L-1, if S(n1+1), S(n1+2),...,S(n1 +q) is greater than 0, where q is a constant greater than 1, then the starting point of the n1 segment is replaced by P(i-1) as the new initial segmentation point; if there is no such index number, the initial slice is not updated. Points. The initial segmentation unit is divided into different units, that is, the initial segmentation unit, assuming that the number of initial segmentation units is M, since the maximum state of the optimal state sequence is 64, the number of initial segmentation units is less than or equal to 8 (the The update process does not change the number of initial segmentation points);

Step 105: Initial segmentation unit number decision: Step 104: After segmenting the speech, a number of segments are obtained. The initial segmentation unit, for the target cipher voice, the number of initial segmentation units is generally equal to the target text number in the target password; for the cryptographic voice, the number of segmentation units is often much smaller than the target text in the target password. number. It can be known from step 104 that the initial number of test speech units is M, assuming that the minimum number of split units is T, when M < T, the system directly rejects the requester, the decision ends, otherwise, step 106 is performed;

Step 106: Forced segmentation: when 8-M>0, take the longest segmentation unit of the corresponding feature segment in the initial segmentation unit, and divide the feature segment into (8-M+1) portions, forcibly The total number of segmentation units after segmentation becomes 8;

Step 107: Calculate the text matching score: match the corresponding feature sequence of the segmentation unit obtained in step 106 with the target model HMM of 0 to 9 target texts, and each segmentation unit corresponds to 10 matching scores, assuming the score is divided. Word_score(i,j), the variable represents the text matching score of the model of the i-th segmentation unit and the number j in the dynamic password;

Step 108, voiceprint and text preliminary certification:

Taking the text corresponding to the m highest scores in the segment matching unit text matching score corresponding to each of the segmentation units as the candidate text, and if the candidate text includes the target text corresponding to the segmentation unit, The segmentation unit passes the authentication, and calculates the total number of the segmentation units that pass. If the total number of the segmentation units passed is less than or equal to the fourth threshold, the text authentication fails, the identity authentication fails, and the decision ends; if the segmentation is passed If the total number of units is greater than the fourth threshold, the text authentication of the input voice passes;

Determining whether the voiceprint matching score is greater than a fifth threshold, if yes, the voiceprint authentication is passed, the identity authentication is passed, and the decision is ended; if not, the text of each of the segmentation units and the corresponding target text model is scored and The voiceprint matching score is scored, and the regularized score is used as the input of the decision classifier for identity authentication.

As shown in Figure 5, the implementation method is as follows:

Step 108a: Each of the segmentation units takes m highest scores: as shown in step 106 above, each segmentation unit has 10 scores, each taking m (generally 2 or 3) highest scores, corresponding to m The text to be matched;

Step 108b, segmentation unit text authentication: text authentication for each segmentation unit, if cut If the m texts to be matched corresponding to the sub-units include the target text corresponding to the segmentation unit, the text authentication of the segmentation unit passes, and vice versa, the authentication fails;

Step 108c, calculating the total number of texts passed by the segmentation unit text authentication;

Step 108d: Test voice text authentication: Assume that the minimum number of text authentication passes by the test voice segmentation unit is p. When W is greater than p, it is determined that the voice text authentication passes, and the process proceeds to step 108e. Otherwise, the text authentication fails. The identity authentication fails and the judgment ends;

Step 108e: testing the voice voiceprint authentication: setting a larger voiceprint threshold to ensure the strictness of the system. When the voiceprint matching score is greater than the threshold, the voiceprint authentication is passed, and the test voice identity authentication is passed, otherwise, the process proceeds to Step 109;

Step 109: The score is regular: firstly, the scores mean and variance of the target text model corresponding to a large number of vocabulary voices are obtained, and the text corresponding to each segmentation unit in the test voice is scored, and the average value of the spoof score is subtracted and divided by the standard deviation. . As shown in Figure 6, the implementation method is as follows:

Step 109a, seeking a large number of spoofed text scores: taking a single digital model HMM of 0-9 in turn, assuming that the model HMM ^{l of the} number l is taken, according to the Viterbi algorithm, taking a large number of non-l spoofed speech features as input of the model HMM ^l , Get a lot of fake text scores;

Step 109b: Find the mean value and the standard deviation: calculate the average value and the standard deviation of the text of the text corresponding to each text;

Step 109c, zero normalization and normalization: On the basis of calculating the text matching score in step 107, the scoring of each segmentation unit and its corresponding target text model is found, and each segmentation unit corresponds to a text score. According to the zero-homing method, each text score is subtracted from the default score of the corresponding text and divided by the standard deviation to obtain a regular text matching score, and the voiceprint matching score obtained in step 103 is adjusted. The text matching scores are combined to form a 9-dimensional feature vector score. Since the voiceprint matching score in the feature vector is scored by the target speaker or the caller's voiceprint, the score is generally much larger than the text matching score. Therefore, the normalization process is added to the feature vector to make the voiceprint The matching score and the text matching score are both between [0, 1]. Assuming that the maximum and minimum values of the feature vector are max_score and min_score, respectively, linearly transform the feature vector to obtain a new feature vector new_score=(score-min_score)/(max_score-min_score);

Step 110: The decision is made by using the integrated decision classifier to determine the input feature vector new_score. For each input, the output is 1 or 0. When the output is 1, the test voice decision is passed, and when the output is 0, the test voice is rejected.

Embodiment 2:

For the initial segmentation of the speech feature in step 104 in the first embodiment, the segmentation unit number determination in step 105, and the forced segmentation in step 106, the following method is used in the embodiment to perform segmentation and decision:

Step 201, initial segmentation of the voice signal;

In this embodiment, the splitting is started according to the length of the initial splitting unit from the largest to the smallest, and each of the initial splitting units is divided into two segments at a time, until the total number of the splitting units is equal to The number of the target texts;

If the number of forced segmentation is greater than or equal to the second threshold, the forced segmentation ends; if the number of forced segmentation is less than the second threshold, then each of the currently segmented cells is matched with each target text HMM. Determining, respectively, the target text HMM corresponding to the highest score, and combining the selected target text HMM into a second composite HMM;

Performing Viterbi decoding as the input of the second composite HMM to obtain a second state output sequence corresponding to a state in which the number of states of the single target text HMM is an integer multiple of the second target output sequence The position is used as a segmentation point, and the different units obtained by dividing the speech feature by the segmentation point are the segmentation unit, and if the current number of segmentation units is less than the third threshold, the current slice is cut. The sub-unit continues to perform forced segmentation as the initial segmentation unit. If the current number of segmentation units is greater than or equal to the third threshold, the forced segmentation ends, and the sliced unit after the forced segmentation is used as the final slice. Sub-unit. As shown in Figure 8, the following steps are included:

Step 201a, initial segmentation: calculating a voice signal envelope, and selecting a region near the 8 maximum envelopes as an initial segmentation result;

In step 201b, the initial segmentation segment is determined according to the scoring: each segment is scored from 0 to 9 digital models, and each segment takes the number corresponding to the highest score and serves as the decision result of the segment;

Step 201c, the combination of the composite HMM model: according to the segmentation decision result in the above step 201b, select the corresponding HMM model, combined into a composite HMM model, the combination process can be seen in step 104a in the first embodiment;

Step 201d is further divided according to Viterbi decoding: Viterbi decoding is performed on the input signal according to the combined model outputted in step 201c, and the signal is further divided according to the optimal state sequence. For the segmentation process, refer to step 104c in the first embodiment.

Step 202: Force segmentation: Sort the segment lengths into two sizes, and divide them into two in size, until they are divided into 8 segments.

Step 203: Initial segmentation decision: if the number of segmentation segments in step 201d is less than X (corresponding to the third threshold value, X<8), then go to step 201b, and the output result of step 202 is input as step 201b, and continue. Perform segmentation; if the number of segments is greater than or equal to X, the segmentation ends. Set a maximum number of iterations D (corresponding to the second threshold). If the number of iterations of the process is equal to D, the number of segments in step 201b is still less than X, then the iteration is stopped and the speech is rejected; if the number of iterations of the process is less than D When the number of divided segments is equal to or greater than X, the decision is continued, and step 107 and subsequent steps in the first embodiment are performed.

FIG. 9 is a schematic diagram of an apparatus for identity authentication according to an embodiment of the present invention. The apparatus of this embodiment includes a PNN classifier. As shown in FIG. 9, the apparatus of this embodiment includes:

An authentication module, configured to match a score according to the segmentation unit text, and the voiceprint matching score The number and the pre-trained PNN classifier perform identity authentication.

In an optional embodiment, the apparatus further includes a processing module,

The text matching module is configured to match the target speech with the target text model to obtain a first text score, and to match the non-target speech with the target text model to obtain a second text score;

In an optional embodiment, the processing module is further configured to sequentially select the target text model, and the voice feature of the non-target text is matched with the corresponding target text model, and the pseudo text is scored to obtain the Mean and standard deviation of the spoofed text score corresponding to the target text model; subtracting the average of the corresponding spoofed text scores by the first text score and the second text score, respectively, and dividing by the standard deviation, Separating the normalized text scores; combining the normalized first text scores and the first voiceprint scores to obtain maximum and minimum values corresponding to each target text; using the maximum and minimum values to be normalized a text score and the first voiceprint score are normalized as the acceptance feature information of the PNN classifier; the normalized second text score and the second voiceprint score are combined to obtain each target text corresponding a maximum value and a minimum value; normalizing the normalized second text score and the second voiceprint score using the maximum and minimum values as a rejection of the PNN classifier Feature information.

In an optional embodiment, the segmentation module segments the input voice according to the voice feature and a preset target text model, and obtains an initial segmentation unit, including: according to target text in the target password a sequence, combining the corresponding target text hidden Markov model HMM into a first composite HMM; and performing the speech feature as an input of the first composite HMM Decoding, obtaining a first state output sequence, and using a position corresponding to a state of an integer multiple of a state number of a single target text HMM in the first state output sequence as an initial segmentation point; sequentially selecting the adjacent two The initial segmentation point is used as a start and end point of the interval. In the interval, the average energy is calculated in units of specified frames, and the point at which the average energy is continuously increased by a specified number is found, and the point at which the increase is started is taken as a new initial segmentation point. The initial segmentation unit is divided by the initial segmentation point.

In an optional embodiment, the sharding module combines the corresponding target texts HMM into the first composite HMM, including: the state number of the first composite HMM is the sum of the states of the single target text HMM; Each state of the first composite HMM has a Gaussian mixture model parameter that is the same as a Gaussian mixture model parameter that each state of the single target text HMM has; the last state in the state transition matrix of the single target text HMM is itself The transition probability is set to 0, the state transition probability of transitioning to the next state is set to 1; the state transition probability matrix of the last single target text HMM of the target text is not changed; the state transition probability matrix of the single target text HMM is The state transition probability matrix of the composite HMM is obtained by merging in a single target text arrangement order of the target text.

In an optional embodiment, the segmentation module performs forced segmentation on the initial segmentation unit, such that the total number of segmentation units is the same as the number of preset target texts, including: selecting feature segments The longest initial splitting unit performs forced splitting, so that the total number of all splitting units after the forced splitting is the same as the preset target text.

In an optional embodiment, the segmentation module performs forced segmentation on the initial segmentation unit, such that the total number of segmentation units is the same as the number of preset target texts, including: The lengths of the initial splitting units are split from the largest to the smallest, and each of the initial splitting units is divided into two segments at a time, until the total number of the splitting units is equal to the number of the target texts. If the number of forced segmentation is greater than or equal to the second threshold, the forced segmentation ends; if the number of forced segmentation is less than the second threshold, then each of the currently segmented cells is matched with each target text HMM. Sorting, respectively selecting the target text HMM corresponding to the highest score, combining the selected target text HMM into a second composite HMM; and performing the Viterbi decoding as the input of the second composite HMM Obtaining a second state output sequence, wherein the second state output sequence is an integer multiple of a state number of a single target text HMM The position corresponding to the state is used as a segmentation point, and the different units obtained by segmenting the speech feature by the segmentation point are segmentation units. If the current number of segmentation units is less than the third threshold, the current segmentation is performed. The segmentation unit continues to perform forced segmentation as the initial segmentation unit. If the current number of segmentation cells is greater than or equal to the third threshold, the forced segmentation ends, and the segmentation unit after the forced segmentation is used as the final slice. Sub-unit.

In an optional embodiment, the text matching module matches the voice features of each of the segmentation units with all of the target text models to obtain each of the segmentation units and each of the target texts. The segmentation unit text matching score of the model includes: using the speech feature of each of the segmentation units as an input of each target text hidden Markov model HMM, and using the output probability obtained according to the Viterbi algorithm as a corresponding segmentation The unit text matches the score.

In an optional embodiment, the authentication module performs identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and a pre-trained decision classifier, including: taking each of the segmentation units a text corresponding to the m highest scores in the corresponding segmentation unit text matching score is used as the candidate text, and if the candidate text includes the target text corresponding to the segmentation unit, the segmentation unit passes the authentication. Calculating the total number of the splitting units that pass, if the total number of the splitting units passed is less than or equal to the fourth threshold, the text authentication fails, the identity authentication fails, and the decision ends; if the total number of splitting units passed is greater than the fourth threshold And determining, by the text authentication of the input voice, whether the voiceprint matching score is greater than a fifth threshold, if yes, the voiceprint authentication is passed, the identity authentication is passed, and the judgment ends; if not, each of the segments is segmented The unit scores the text corresponding to the target text model and the voiceprint matching scores are scored, and the normalized score is used as the input of the PNN classifier. Line identity authentication.

The embodiment of the invention further provides a computer readable storage medium. Optionally, in this embodiment, the foregoing storage medium may be configured to store program code executed by a processor, and the steps of the program code are as follows:

S1: acquiring a voice feature of the input voice, and matching the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

S2, segmenting the input voice according to the voice feature and a preset target text model, and acquiring an initial segmentation unit and a number of initial voice segmentation units, such as the initial voice slice. If the number of the sub-units is less than the first threshold, the input speech is determined to be an illegal speech; if the number of the initial speech segmentation units is greater than or equal to the first threshold, the initial segmentation unit is forcibly segmented So that the total number of segmentation units is the same as the number of preset target texts;

S3. Matching the voice features of each of the segmentation units with all the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models;

S4. Perform identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and a pre-trained probabilistic neural network PNN classifier.

Optionally, in the embodiment, the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), and a Random Access Memory (RAM). A removable medium such as a hard disk, a disk, or a disc that can store program code.

One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware or in the form of a software function module. The invention is not limited to any specific form of combination of hardware and software.

The above is only a preferred embodiment of the present invention, and of course, the present invention may be embodied in various other embodiments without departing from the spirit and scope of the invention. Corresponding changes and modifications are intended to be included within the scope of the appended claims.

Industrial applicability

The foregoing technical solution provided by the embodiment of the present invention can be applied to the identity authentication process, combining voiceprint and dynamic password authentication, thereby achieving the purpose of double verification for the user. Increased system security, reliability and accuracy.

Claims

A method of identity authentication, including:

Obtaining a voice feature of the input voice, and matching the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

And segmenting the input voice according to the voice feature and the preset target text model, and acquiring the number of the initial segmentation unit and the initial voice segmentation unit, where the number of the initial voice segmentation units is smaller than the first And determining, by the threshold, that the input voice is an illegal voice; if the number of the initial voice segmentation units is greater than or equal to the first threshold, performing forced segmentation on the initial segmentation unit, so that the total number of the segmentation units is The number is the same as the number of preset target texts;

Matching the voice features of each of the segmentation units with all of the target text models to obtain a segmentation unit text matching score for each of the segmentation units and each of the target text models;

Identity authentication is performed according to the segmentation unit text matching score, the voiceprint matching score, and a pre-trained probabilistic neural network PNN classifier.
The method of claim 1 wherein said PNN classifier is trained in the following manner:

Matching the target speech with the target text model and the target voiceprint model to obtain a first text score and a first voice score, respectively, and combining the first text score and the first voice score into the decision classifier. Accept feature information;

Matching the non-target speech with the target text model and the target voiceprint model to obtain a second text score and a second voice score, respectively, and combining the second text score and the second voice score into the decision classifier Rejection characteristic information;

The PNN classifier is trained according to the acceptance feature information and the rejection feature information.
The method according to claim 2, further comprising: a voiceprint for said target voice and said non-target voice before training said PNN classifier based on said acceptance feature information and said rejection feature information Scores and text scores are scored, including:

Selecting the target text model in turn, taking the speech features of the non-target text and the corresponding The target text model is matched, and the pseudo-text score is obtained, and the mean and standard deviation of the pseudo-text score corresponding to the target text model are obtained;

Subdividing the first text score and the second text score respectively by the mean value of the corresponding spoofed text scores and dividing by the standard deviation, respectively obtaining a regular text score;

Combining the normalized first text score and the first voiceprint score, obtaining a maximum value and a minimum value corresponding to each target text; using the maximum value and the minimum value to score the normalized first text and the first The voiceprint score is normalized as the acceptance feature information of the PNN classifier;

Combining the normalized second text score and the second voiceprint score, obtaining a maximum value and a minimum value corresponding to each target text; using the maximum value and the minimum value to score the regular second text and the second The voiceprint score is normalized as the rejection feature information of the PNN classifier.
The method according to claim 1, wherein the segmenting the input speech according to the voice feature and a preset target text model to obtain an initial segmentation unit comprises:

Correlating the corresponding target text hidden Markov model HMM into a first composite HMM according to the target text sequence in the target password;

Performing Viterbi decoding as the input of the first composite HMM to obtain a first state output sequence, and corresponding to a state in the first state output sequence that is an integer multiple of a state number of a single target text HMM Position as the initial segmentation point;

The adjacent two initial segmentation points are sequentially selected as a range start and end point, in which the average energy is calculated in units of specified frames, and the point where the average energy continuously increases by a specified number of times is found, and the point at which the increase is started is started. As the new initial segmentation point, the initial segmentation unit is divided by the initial segmentation point.
The method of claim 4, wherein combining the corresponding target texts HMM into the first composite HMM comprises:

The state number of the first composite HMM is a sum of state numbers of a single target text HMM; a Gaussian mixture model parameter of each state of the first composite HMM and a Gaussian mixture model parameter of each state of the single target text HMM model the same;

The last state self transition probability in the state transition matrix of the single target text HMM is set to 0, the state transition probability transferred to the next state is set to 1; the state transition of the last single target text HMM of the target text The probability matrix is not changed;

The state transition probability matrices of the single target text HMM are merged according to a single target text arrangement order of the target text, to obtain a state transition probability matrix of the composite HMM.
The method according to claim 1, wherein the forcing the initial segmentation unit to be forcibly splitting, so that the total number of segmentation units is the same as the number of preset target texts, including:

The initial segmentation unit having the longest feature segment is selected to perform forced segmentation, so that the total number of all segmentation cells after the forced segmentation is the same as the preset target text number.
The method according to claim 1, wherein the forcing the initial segmentation unit to be forcibly splitting, so that the total number of segmentation units is the same as the number of preset target texts, including:

The forced splitting is started according to the length of the initial splitting unit from the largest to the smallest, and each of the initial splitting units is divided into two segments at a time, until the total number of the splitting units after the splitting is equal to The number of target texts;

If the number of forced segmentation is greater than or equal to the second threshold, the forced segmentation ends; if the number of forced segmentation is less than the second threshold, each current segmentation unit is separately associated with each target text hidden Markov model The HMM performs matching scoring, respectively selecting the target text HMM corresponding to the highest scoring, and combining the selected target text HMM into a second composite HMM; and using the speech feature as an input of the second composite HMM The Viterbi decoding obtains a second state output sequence, and the position corresponding to the state of the second state output sequence that is an integer multiple of the state number of the single target text HMM is used as a segmentation point, and the segmentation point is used for the speech The different unit obtained by the feature segmentation is the segmentation unit. If the current number of the segmentation units is less than the third threshold, the segmentation unit after the current segmentation is used as the initial segmentation unit to continue the forced segmentation. If the current number of the segmentation units is greater than or less than the third threshold, the forced segmentation ends.
The method of claim 1 wherein said matching said speech features of each of said segmentation units with all of said target text models yields each of said segmentation units and each of said target text models Segmentation unit text match scores, including:

The speech feature of each of the segmentation units is used as an input of each target text hidden Markov model HMM, and the output probability obtained according to the Viterbi algorithm is used as a corresponding segmentation unit text matching score.
The method of any of claims 1-8, wherein said performing identity authentication based on said segmentation unit text matching score, said voiceprint matching score, and a pre-trained decision classifier comprises:

Taking the text corresponding to the m highest scores in the segment matching unit text matching score corresponding to each of the segmentation units as the candidate text, and if the candidate text includes the target text corresponding to the segmentation unit, The segmentation unit passes the authentication, and calculates the total number of the splitting units that pass. If the total number of the splitting units passed is less than or equal to the fourth threshold, the text authentication fails, and the identity authentication fails; if the total number of the splitting units passed is greater than The fourth threshold, the text authentication of the input voice passes;

Determining whether the voiceprint matching score is greater than a fifth threshold, if yes, the voiceprint authentication is passed, and the identity authentication is passed; if not, the text of each of the segmentation units and the corresponding target text model is scored and the voiceprint is The matching score is scored and the regularized score is used as the input of the decision classifier for identity authentication.
An apparatus for identity authentication, including a probabilistic neural network PNN classifier, comprising:

The voiceprint matching module is configured to acquire a voice feature of the input voice, and match the voice feature with the pre-stored target voiceprint model to obtain a voiceprint matching score;

a segmentation module, configured to segment the input speech according to the voice feature and a preset target text model, and obtain an initial segmentation unit and a number of initial voice segmentation units, such as the initial voice segmentation unit If the number of the input speech is less than the threshold, the input speech is determined to be an illegal speech; if the number of the initial speech segmentation units is greater than or equal to the first threshold, the initial segmentation unit is forcibly segmented so that the segmentation is performed. The total number of units is the same as the number of preset target texts;

a text matching module, configured to match a voice feature of each of the segmentation units with all of the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models ;

An authentication module is configured to perform identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and the pre-trained PNN classifier.
The apparatus of claim 10, further comprising a processing module,

The voiceprint matching module is configured to match the target voice with the target voiceprint model to obtain a first voiceprint score, and match the non-target voice with the target voiceprint model to obtain a second voiceprint score;

The text matching module is configured to match the target speech with the target text model to obtain a first text score, and match the non-target speech with the target text model to obtain a second text score;

The processing module is configured to combine the first text score and the first voice score into the acceptance feature information of the PNN classifier, and combine the second text score and the second voice score into the PNN Rejection feature information of the classifier;

The PNN classifier performs training according to the acceptance feature information and the rejection feature information.
The apparatus according to claim 11, wherein

The processing module is further configured to sequentially select the target text model, and the speech feature of the non-target text is matched with the corresponding target text model, and the pseudo-text score is obtained, and the pseudo-text corresponding to the target text model is obtained. The mean value and the standard deviation of the scores; the first text score and the second text score are respectively subtracted from the average of the corresponding short text scores and divided by the standard deviation, respectively, and the regular text scores are obtained; Combining the normalized first text score and the first voiceprint score, obtaining a maximum value and a minimum value corresponding to each target text; using the maximum value and the minimum value to score the normalized first text and the first The voiceprint score is normalized as the acceptance feature information of the PNN classifier; the normalized second text score and the second voiceprint score are combined to obtain the maximum value and the minimum value corresponding to each target text; The maximum and minimum values normalize the normalized second text score and the second voiceprint score as the rejection feature information of the PNN classifier.
The device according to claim 10, wherein

The segmentation module pairs the input according to the voice feature and a preset target text model Performing segmentation of the voice to obtain the initial segmentation unit includes: combining the corresponding target text hidden Markov model HMM into the first composite HMM according to the target text sequence in the target password; using the voice feature as the first The input of the composite HMM is Viterbi-decoded to obtain a first state output sequence, and the position corresponding to the state of the first state output sequence that is an integer multiple of the state number of the single target text HMM is taken as the initial segmentation point; The two adjacent initial segmentation points are used as interval start and end points. In the interval, the average energy is calculated in units of specified frames, and the average energy is continuously designated to increase the number of points, and the point at which the increase is started is taken as a new one. An initial segmentation point, the initial segmentation unit divided by the initial segmentation point.
The device according to claim 13, wherein

The sharding module combines the corresponding target texts HMM into the first composite HMM, including: the state number of the first composite HMM is the sum of the states of the single target text HMM; and each state of the first composite HMM Having a Gaussian mixture model parameter having the same Gaussian mixture model parameter as each state of the single target text HMM; setting a last state self-transition probability in the state transition matrix of the single target text HMM to 0, shifting to The state transition probability of the next state is set to 1; the state transition probability matrix of the last single target text HMM of the target text is not changed; the state transition probability matrix of the single target text HMM is according to a single target of the target text The text arrangement order is merged to obtain a state transition probability matrix of the composite HMM.
The device according to claim 10, wherein

The segmentation module performs a forced segmentation on the initial segmentation unit, such that the total number of segmentation units is the same as the number of preset target texts, including: selecting the initial segmentation of the feature segment that is the longest The unit performs forced segmentation so that the total number of all the segmentation units after the forced segmentation is the same as the number of preset target texts.
The device according to claim 10, wherein

The segmentation module performs a forced segmentation on the initial segmentation unit, such that the total number of segmentation units is the same as the number of preset target texts, including: according to the length of the initial segmentation unit Splitting into a small order, each time dividing one of the initial splitting units into two segments, until the total number of units after the splitting is equal to the number of the target text; If the number of times is greater than or equal to the second threshold, then the forced segmentation ends; if the number of forced segmentation is less than the second threshold, then each of the currently segmented cells is separately performed with each target text hidden Markov model HMM. Matching the scoring, respectively selecting the target text HMM corresponding to the highest scoring, combining the selected target text HMM into a second composite HMM; and using the speech feature as an input of the second composite HMM to perform Viterbi Decoding, obtaining a second state output sequence, using a position corresponding to a state of an integer multiple of a state number of a single target text HMM in the second state output sequence as a segmentation point, and segmenting the phonetic feature by the segmentation point The obtained different unit is the segmentation unit. If the current number of the segmentation units is less than the third threshold, the segmentation unit after the current segmentation is used as the initial segmentation unit to continue the forced segmentation. If the current number of the segmentation units is greater than or equal to the third threshold, the forced segmentation ends.
The device according to claim 10, wherein

The text matching module matches the voice features of each of the segmentation units with all of the target text models to obtain a segmentation unit text matching score of each of the segmentation units and each of the target text models. The method includes: using the voice feature of each of the segmentation units as an input of each target text hidden Markov model HMM, and using the output probability obtained according to the Viterbi algorithm as the corresponding segmentation unit text matching score.
A device according to any one of claims 10-17, wherein

The authentication module performs identity authentication according to the segmentation unit text matching score, the voiceprint matching score, and the pre-trained decision classifier, including: taking the segmentation unit text corresponding to each of the segmentation units Matching the text corresponding to the m highest scores in the score as the text to be selected. If the text to be selected includes the target text corresponding to the segmentation unit, the segmentation unit passes the authentication, and the total number of the split units passed is calculated. If the total number of the splitting units is less than or equal to the fourth threshold, the text authentication fails, and the identity authentication fails; if the total number of the splitting units passed is greater than the fourth threshold, the text authentication of the input voice passes; Determining whether the voiceprint matching score is greater than a fifth threshold, if yes, the voiceprint authentication is passed, and the identity authentication is passed; if not, the text of each of the segmentation units and the corresponding target text model is scored and the voiceprint is The matching score is scored and the regularized score is used as the input of the PNN classifier for identity authentication.