WO2019179029A1

WO2019179029A1 - Electronic device, identity verification method and computer-readable storage medium

Info

Publication number: WO2019179029A1
Application number: PCT/CN2018/102105
Authority: WO
Inventors: 赵峰; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-03-19
Filing date: 2018-08-24
Publication date: 2019-09-26
Also published as: CN108564955B; CN108564955A

Abstract

An electronic device, an identity verification method and a storage medium, wherein the method comprises: when receiving current voice data of a target user of which the identity is to be verified, acquiring standard voice data corresponding to the identity to be verified, and performing framing processing on the two pieces of standard voice data to obtain a current voice frame group and a standard voice frame group (S10); using a pre-set filter to respectively extract a pre-set type of acoustic feature of each voice frame of the two voice frame groups (S20); respectively inputting the extracted pre-set type of acoustic feature into a pre-trained deep neural network model of a pre-set structure to obtain a feature vector of a pre-set length respectively corresponding to current voice data and standard voice data (S30); and calculating cosine similarity between the two feature vectors, and determining an identity verification result according to the magnitude of the calculated cosine similarity (S40). The accuracy of identity verification of a speaker can be improved.

Description

Electronic device, authentication method, and computer readable storage medium

The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Authentication Method and Computer-Readable Storage Media", filed on March 19, 2018, with the application number of CN 2018102258872, the entire disclosure of which is hereby incorporated by reference. The content is incorporated herein by reference.

Technical field

The present application relates to the field of voiceprint recognition technologies, and in particular, to an electronic device, an authentication method, and a computer readable storage medium.

Background technique

Speaker recognition, commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.

Speaker recognition includes text-related recognition and text-independent recognition. In recent years, text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past. However, in some limited cases, such as the case where the collected speaker has a short effective voice (speech of less than 5 seconds), the existing text-independent speaker recognition technology is not accurate and error-prone.

Summary of the invention

The main purpose of the present application is to provide an electronic device, an authentication method, and a computer readable storage medium, which are intended to improve the accuracy of speaker authentication.

To achieve the above objective, an electronic device proposed by the present application includes a memory and a processor, and the memory stores an identity verification system executable on the processor, where the identity verification system is implemented by the processor The following steps:

After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;

And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;

Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.

Preferably, before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the processor is further configured to execute the identity verification system to implement the following steps:

Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.

Preferably, the training process of the preset structure deep neural network model is:

S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;

S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;

S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;

S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;

S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.

S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.

Preferably, the process of the iterative training of the preset structure deep neural network model comprises:

Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;

Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x _i1 , x _i2 , x _i3 ) is composed of three different feature vectors x _i1 , x _i2 and x _i3 , wherein , x _i1 and x _i2 correspond to the same speaker, x _i1 and x _i3 correspond to different speakers, and i is a positive integer;

Calculate the cosine similarity between x _i1 and x _i2 using a predetermined calculation formula

And calculate the cosine similarity between x _i1 and x _i3

Cosine similarity

And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:

Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.

Preferably, the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:

Where α is a constant with a value ranging from 0.05 to 0.2.

Representing the cosine similarity of two eigenvectors belonging to the same speaker,

Represents the cosine similarity of two feature vectors that do not belong to the same speaker.

The application also provides an authentication method, which includes:

Preferably, before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the identity verification method further includes the steps of:

The third layer: is the deep neural network DNN fully connected layer;

The fifth layer: is the loss layer, the formula of the loss function L is:

Where α is a constant with a value ranging from 0.05 to 0.2.

The application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to perform the following steps:

And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the Determining a cosine similarity of the two feature vectors corresponding to the preset feature vectors of the standard speech data, and determining an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and Verify the failure result.

The technical solution of the present application first performs frame processing on the current voice data of the target user that receives the identity to be verified and the standard voice data to be verified, and extracts the pre-preparation of each voice frame obtained by using the preset filter. The type acoustic characteristics are input, and the extracted preset type acoustic features are input into the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics and standards corresponding to the current voice data. After the preset type acoustic characteristics corresponding to the voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude. In the technical solution of the embodiment, the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected The speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result. Compared with the prior art, the scheme is for the speaker identity verification. More accurate and reliable.

DRAWINGS

1 is a schematic flowchart of an embodiment of an identity verification method according to the present application;

2 is a schematic flow chart of a training process of a preset structure deep neural network model according to the present application;

3 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application;

4 is a block diagram of a program of an embodiment of an identity verification system of the present application;

FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.

detailed description

The principles and features of the present application are described in the following with reference to the accompanying drawings, which are only used to explain the present application and are not intended to limit the scope of the application.

As shown in FIG. 1, FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to an application.

In this embodiment, the identity verification method includes:

Step S10: After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

The identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice. The current voice frame group corresponding to the data (including a plurality of voice frames obtained by dividing the current voice data) and the standard voice frame group corresponding to the standard voice data (including a plurality of voice frames obtained by dividing the standard voice data). The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.

Step S20: extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;

After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group. The preset type acoustic features corresponding to the respective voice frames and the preset type acoustic features corresponding to the respective voice frames in the standard voice frame group. For example, the preset filter is a Mel filter, and the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.

Step S30, respectively input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice. a feature vector of a preset length corresponding to the data and the standard voice data;

Step S40: Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.

The authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame. After the feature frame is extracted from the voice frame in the group, the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and determining the identity if the cosine similarity is greater than a preset threshold The verification passes, and vice versa, it determines that the authentication failed. Wherein, the cosine similarity calculation formula is: cos(x _i , x _j )=x _i ^T x _j , x _i and x _j represent two feature vectors, and T is a predetermined value.

In the technical solution of the embodiment, the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter. The preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data. After the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude. In the technical solution of the embodiment, the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected The speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result. Compared with the prior art, the scheme is for the speaker identity verification. More accurate and reliable.

Further, before the step of performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, the method further includes the following steps:

Some non-speaker voice parts (for example, mute or noise) are included in the collected current voice data and the pre-stored standard voice data, and if the parts are not deleted, the current voice data or the standard pair voice data is framed. In the speech frame group obtained after processing, a speech frame containing a non-speaker speech portion (even all non-speaker speech in an individual speech frame) may appear, so that a predetermined filter is used according to these non-speaker speech portions. The preset type acoustic feature extracted by the speech frame belongs to the impurity feature, which reduces the accuracy of the result obtained by the preset structure depth neural network model; therefore, the present embodiment detects the current voice data and the standard before processing the voice data frame. The non-speaker voice part of the voice data is deleted, and the detected non-speaker voice part is deleted. The non-speaker voice part of the embodiment is detected by a voice activity detection (VAD).

As shown in FIG. 2, in this embodiment, the training process of the preset structure deep neural network model is:

First, a preset number (for example, 10000) of voice data samples are prepared, and each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.

Performing active endpoint detection on the voice data samples to detect and delete non-speaker voices (eg, mute or noise) in each voice data sample to avoid the presence of voiceprint features in the voice data samples and corresponding speaker identity Irrelevant speech data, which affects the training effect on the model.

For example, 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.

The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds; the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.

S5, dividing the preset type acoustic features corresponding to each voice frame group in the training set into M batches, inputting the iterative training in the preset structure deep neural network model in batches, and deepening the nerves in the preset structure After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;

The preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches. The batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch. Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics The feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters. After the iterative training is completed, the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time. Set the type acoustic feature to the preset structure depth neural network model, based on the identity of the input two standard voice data To confirm whether the verification structure of the output is correct, after completing the verification of each group, calculate the accuracy rate according to the correct number of verification results, for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.

The verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model; The accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.

If the accuracy of the verification of the preset structure depth neural network model by the verification set is less than or equal to the preset threshold, then the training of the preset structure deep neural network model has not reached the expected standard, and Is the number of training sets insufficient or the number of verification sets is insufficient, so in this case, increase the number of acquired speech data samples (for example, increase the fixed number each time or increase the random number each time), and then, based on this, The above steps S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.

In this embodiment, the process of the iterative training of the preset structure deep neural network model includes:

And calculate the cosine similarity between x _i1 and x _i3

Cosine similarity

The model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.

Further, the network structure of the preset structure deep neural network model of this embodiment is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;

The third layer: is the deep neural network DNN fully connected layer;

The fifth layer: is the loss layer, the formula of the loss function L is:

Where α is a constant with a value ranging from 0.05 to 0.2.

In addition, the present application also proposes an identity verification system.

Please refer to FIG. 3 , which is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.

In the present embodiment, the identity verification system 10 is installed and operates in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.

The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like. The memory 11 can also be used to temporarily store data that has been output or is about to be output.

The processor 12, in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.

The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments. The display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like. The components 11-13 of the electronic device 1 communicate with one another via a system bus.

Please refer to FIG. 4, which is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application. In this embodiment, the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application. For example, in FIG. 4, the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104. A module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:

The framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

The extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;

The calculating module 103 is configured to input the preset type acoustic features corresponding to the extracted current voice frame group and the preset type acoustic features corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the a feature vector of a preset length corresponding to each of the current voice data and the standard voice data;

The result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.

As shown in FIG. 5, FIG. 5 is a program module diagram of a second embodiment of the identity verification system of the present application.

In this embodiment, the identity verification system further includes:

The detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.

In this embodiment, the training process of the preset structure deep neural network model is (refer to FIG. 2):

And calculate the cosine similarity between x _i1 and x _i3

Cosine similarity

The third layer: is the deep neural network DNN fully connected layer;

The fifth layer: is the loss layer, the formula of the loss function L is:

Where α is a constant with a value ranging from 0.05 to 0.2.

Further, the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the patents of the present application, and the equivalent structural transformation, or direct/indirect use, of the present application and the contents of the drawings is used in the present invention. All other related technical fields are included in the patent protection scope of the present application.

Claims

An electronic device, comprising: a memory and a processor, the memory storing an identity verification system operable on the processor, the identity verification system being executed by the processor Implement the following steps:

After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;

And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;

Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
The electronic device according to claim 1, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The electronic device according to claim 1, wherein the processor is further configured to perform the step of performing the framing processing of the current voice data and the standard voice data according to preset framing parameters, respectively. An authentication system to implement the following steps:

Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
The electronic device according to claim 3, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The electronic device according to claim 1, wherein the training process of the preset structure deep neural network model is:

S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;

S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;

S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;

S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;

S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.

S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
The electronic device according to claim 5, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The electronic device according to claim 5, wherein the process of iterative training of the predetermined structure depth neural network model comprises:

Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;

Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;

Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
And calculate the cosine similarity between x i1 and x i3

Cosine similarity
And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
The electronic device according to claim 7, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
An authentication method, characterized in that the identity verification method comprises:

After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;

And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;

Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
The identity verification method according to claim 9, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The identity verification method according to claim 9, wherein the identity verification method further comprises the step of performing the step of framing the current voice data and the standard voice data according to preset framing parameters, respectively. :

Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
The identity verification method according to claim 11, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The identity verification method according to claim 9, wherein the training process of the preset structure deep neural network model is:

S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;

S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;

S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;

S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;

S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.

S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
The identity verification method according to claim 13, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
The identity verification method according to claim 13, wherein the process of iterative training of the preset structure deep neural network model comprises:

Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;

Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;

Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
And calculate the cosine similarity between x i1 and x i3

Cosine similarity
And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
The identity verification method according to claim 15, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
A computer readable storage medium, characterized in that the computer readable storage medium stores an identity verification system, the identity verification system being executable by at least one processor to cause the at least one processor to perform the following steps:

After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;

Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;

And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;

Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
The computer readable storage medium according to claim 17, wherein said identity verification method further comprises the step of performing frame processing on said current voice data and said standard voice data according to preset framing parameters, respectively Including steps:

Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
The computer readable storage medium of claim 17, wherein the training process of the predetermined structure depth neural network model is:

S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;

S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;

S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;

S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;

S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.

S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and perform the above steps S1-S5 again based on the added voice data samples.
The computer readable storage medium of claim 17, wherein the network structure of the preset structure deep neural network model is as follows:

The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;

The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;

The third layer: is the deep neural network DNN fully connected layer;

The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;

The fifth layer: is the loss layer, the formula of the loss function L is:
Where α is a constant with a value ranging from 0.05 to 0.2.
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.