WO2019179029A1 - Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur - Google Patents
Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur Download PDFInfo
- Publication number
- WO2019179029A1 WO2019179029A1 PCT/CN2018/102105 CN2018102105W WO2019179029A1 WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1 CN 2018102105 W CN2018102105 W CN 2018102105W WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- layer
- voice data
- preset
- neural network
- average
- Prior art date
Links
- 238000012795 verification Methods 0.000 title claims abstract description 147
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 153
- 238000003062 neural network model Methods 0.000 claims abstract description 84
- 238000009432 framing Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims description 83
- 238000013528 artificial neural network Methods 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 37
- 238000001514 detection method Methods 0.000 claims description 18
- 230000007774 longterm Effects 0.000 claims description 13
- 230000015654 memory Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 13
- 230000006403 short-term memory Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000003595 spectral effect Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000013075 data extraction Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000012535 impurity Substances 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Definitions
- the present application relates to the field of voiceprint recognition technologies, and in particular, to an electronic device, an authentication method, and a computer readable storage medium.
- Speaker recognition commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.
- Speaker recognition includes text-related recognition and text-independent recognition.
- text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past.
- the existing text-independent speaker recognition technology is not accurate and error-prone.
- the main purpose of the present application is to provide an electronic device, an authentication method, and a computer readable storage medium, which are intended to improve the accuracy of speaker authentication.
- an electronic device proposed by the present application includes a memory and a processor, and the memory stores an identity verification system executable on the processor, where the identity verification system is implemented by the processor The following steps:
- the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
- the processor is further configured to execute the identity verification system to implement the following steps:
- Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
- the training process of the preset structure deep neural network model is:
- S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
- the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
- the process of the iterative training of the preset structure deep neural network model comprises:
- the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
- ⁇ is a constant ranging from 0.05 to 0.2
- N is the number of triples obtained.
- the network structure of the preset structure deep neural network model is as follows:
- the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
- the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
- Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
- the third layer is the deep neural network DNN fully connected layer
- the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
- the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
- the application also provides an authentication method, which includes:
- the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
- the identity verification method before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the identity verification method further includes the steps of:
- Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
- the training process of the preset structure deep neural network model is:
- S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
- the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
- the network structure of the preset structure deep neural network model is as follows:
- the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
- the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
- Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
- the third layer is the deep neural network DNN fully connected layer
- the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
- the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
- the application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to perform the following steps:
- the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
- the technical solution of the present application first performs frame processing on the current voice data of the target user that receives the identity to be verified and the standard voice data to be verified, and extracts the pre-preparation of each voice frame obtained by using the preset filter.
- the type acoustic characteristics are input, and the extracted preset type acoustic features are input into the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics and standards corresponding to the current voice data.
- the preset type acoustic characteristics corresponding to the voice data are converted into corresponding feature vectors
- the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
- the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
- the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
- the scheme is for the speaker identity verification. More accurate and reliable.
- FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to the present application.
- FIG. 2 is a schematic flow chart of a training process of a preset structure deep neural network model according to the present application
- FIG. 3 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application.
- FIG. 4 is a block diagram of a program of an embodiment of an identity verification system of the present application.
- FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
- FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to an application.
- the identity verification method includes:
- Step S10 After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
- the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
- the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
- the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
- the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
- Step S20 extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
- the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
- the preset filter is a Mel filter
- the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
- Step S30 respectively input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice.
- Step S40 Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
- the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
- the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
- the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
- the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
- the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
- the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
- the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
- the scheme is for the speaker identity verification. More accurate and reliable.
- the method further includes the following steps:
- Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
- non-speaker voice parts for example, mute or noise
- VAD voice activity detection
- the training process of the preset structure deep neural network model is:
- S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
- each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
- non-speaker voices eg, mute or noise
- the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
- 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
- the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
- the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
- the verification set is used to verify the accuracy of the preset structure deep neural network model
- the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
- the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
- Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
- the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
- the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
- the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
- calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
- the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
- the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
- step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
- the process of the iterative training of the preset structure deep neural network model includes:
- the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
- ⁇ is a constant ranging from 0.05 to 0.2
- N is the number of triples obtained.
- the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
- the network structure of the preset structure deep neural network model of this embodiment is as follows:
- the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
- the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
- Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
- the third layer is the deep neural network DNN fully connected layer
- the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
- the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
- the present application also proposes an identity verification system.
- FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.
- the identity verification system 10 is installed and operates in the electronic device 1.
- the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
- the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
- Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
- the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1.
- the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc.
- the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
- the memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like.
- the memory 11 can also be used to temporarily store data that has been output or is about to be output.
- the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
- CPU Central Processing Unit
- microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
- the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
- the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like.
- the components 11-13 of the electronic device 1 communicate with one another via a system bus.
- FIG. 4 is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application.
- the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
- the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104.
- a module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:
- the framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
- the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
- the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
- the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
- the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
- the extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
- the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
- the preset filter is a Mel filter
- the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
- the calculating module 103 is configured to input the preset type acoustic features corresponding to the extracted current voice frame group and the preset type acoustic features corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the a feature vector of a preset length corresponding to each of the current voice data and the standard voice data;
- the result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
- the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
- the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
- the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
- the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
- the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
- the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
- the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
- the scheme is for the speaker identity verification. More accurate and reliable.
- FIG. 5 is a program module diagram of a second embodiment of the identity verification system of the present application.
- the identity verification system further includes:
- the detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.
- non-speaker voice parts for example, mute or noise
- VAD voice activity detection
- the training process of the preset structure deep neural network model is (refer to FIG. 2):
- S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
- each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
- non-speaker voices eg, mute or noise
- the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
- 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
- the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
- the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
- the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
- the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
- Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
- the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
- the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
- the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
- calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
- the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
- the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
- step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
- the process of the iterative training of the preset structure deep neural network model includes:
- the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
- ⁇ is a constant ranging from 0.05 to 0.2
- N is the number of triples obtained.
- the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
- the network structure of the preset structure deep neural network model of this embodiment is as follows:
- the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
- the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
- Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
- the third layer is the deep neural network DNN fully connected layer
- the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
- the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
- the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un dispositif électronique, un procédé de vérification d'identité et un support d'informations, le procédé consistant : lors de la réception de données vocales actuelles d'un utilisateur cible dont l'identité doit être vérifiée, à acquérir des données vocales standard correspondant à l'identité à vérifier et à effectuer un traitement de tramage sur les deux éléments de données vocales standard pour obtenir un groupe de trames vocales actuelles et un groupe de trames vocales standard (S10) ; à utiliser un filtre prédéfini pour extraire respectivement un type prédéfini de caractéristique acoustique de chaque trame vocale des deux groupes de trames vocales (S20) ; à entrer respectivement le type prédéfini extrait de caractéristique acoustique dans un modèle de réseau neuronal profond pré-instruit d'une structure prédéfinie, afin d'obtenir un vecteur de caractéristiques d'une longueur prédéfinie, correspondant respectivement aux données vocales actuelles et aux données vocales standard (S30) ; et à calculer une similarité cosinus entre les deux vecteurs de caractéristiques et à déterminer un résultat de vérification d'identité en fonction du degré de la similarité calculée cosinus (S40). On peut ainsi améliorer la précision de la vérification d'identité d'un locuteur.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810225887.2A CN108564955B (zh) | 2018-03-19 | 2018-03-19 | 电子装置、身份验证方法和计算机可读存储介质 |
CN201810225887.2 | 2018-03-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019179029A1 true WO2019179029A1 (fr) | 2019-09-26 |
Family
ID=63532742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/102105 WO2019179029A1 (fr) | 2018-03-19 | 2018-08-24 | Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108564955B (fr) |
WO (1) | WO2019179029A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112712792A (zh) * | 2019-10-25 | 2021-04-27 | Tcl集团股份有限公司 | 一种方言识别模型的训练方法、可读存储介质及终端设备 |
CN114648978A (zh) * | 2022-04-27 | 2022-06-21 | 腾讯科技(深圳)有限公司 | 一种语音验证处理的方法以及相关装置 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108564954B (zh) * | 2018-03-19 | 2020-01-10 | 平安科技(深圳)有限公司 | 深度神经网络模型、电子装置、身份验证方法和存储介质 |
CN110164452B (zh) * | 2018-10-10 | 2023-03-10 | 腾讯科技(深圳)有限公司 | 一种声纹识别的方法、模型训练的方法以及服务器 |
CN109346086A (zh) * | 2018-10-26 | 2019-02-15 | 平安科技(深圳)有限公司 | 声纹识别方法、装置、计算机设备和计算机可读存储介质 |
US10887317B2 (en) * | 2018-11-28 | 2021-01-05 | Sap Se | Progressive authentication security adapter |
CN110148402A (zh) * | 2019-05-07 | 2019-08-20 | 平安科技(深圳)有限公司 | 语音处理方法、装置、计算机设备及存储介质 |
CN110570871A (zh) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | 一种基于TristouNet的声纹识别方法、装置及设备 |
CN111933153B (zh) * | 2020-07-07 | 2024-03-08 | 北京捷通华声科技股份有限公司 | 一种语音分割点的确定方法和装置 |
CN112016673A (zh) * | 2020-07-24 | 2020-12-01 | 浙江工业大学 | 一种基于优化lstm的移动设备用户认证方法及装置 |
CN112309365B (zh) * | 2020-10-21 | 2024-05-10 | 北京大米科技有限公司 | 语音合成模型的训练方法、装置、存储介质以及电子设备 |
CN112347788A (zh) * | 2020-11-06 | 2021-02-09 | 平安消费金融有限公司 | 语料处理方法、装置及存储介质 |
CN113178197B (zh) * | 2021-04-27 | 2024-01-09 | 平安科技(深圳)有限公司 | 语音验证模型的训练方法、装置以及计算机设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060025995A1 (en) * | 2004-07-29 | 2006-02-02 | Erhart George W | Method and apparatus for natural language call routing using confidence scores |
CN105139857A (zh) * | 2015-09-02 | 2015-12-09 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | 一种自动说话人识别中针对语音欺骗的对抗方法 |
CN106205624A (zh) * | 2016-07-15 | 2016-12-07 | 河海大学 | 一种基于dbscan算法的声纹识别方法 |
CN106782564A (zh) * | 2016-11-18 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | 用于处理语音数据的方法和装置 |
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
-
2018
- 2018-03-19 CN CN201810225887.2A patent/CN108564955B/zh active Active
- 2018-08-24 WO PCT/CN2018/102105 patent/WO2019179029A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060025995A1 (en) * | 2004-07-29 | 2006-02-02 | Erhart George W | Method and apparatus for natural language call routing using confidence scores |
CN105139857A (zh) * | 2015-09-02 | 2015-12-09 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | 一种自动说话人识别中针对语音欺骗的对抗方法 |
CN106205624A (zh) * | 2016-07-15 | 2016-12-07 | 河海大学 | 一种基于dbscan算法的声纹识别方法 |
CN106782564A (zh) * | 2016-11-18 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | 用于处理语音数据的方法和装置 |
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112712792A (zh) * | 2019-10-25 | 2021-04-27 | Tcl集团股份有限公司 | 一种方言识别模型的训练方法、可读存储介质及终端设备 |
CN114648978A (zh) * | 2022-04-27 | 2022-06-21 | 腾讯科技(深圳)有限公司 | 一种语音验证处理的方法以及相关装置 |
Also Published As
Publication number | Publication date |
---|---|
CN108564955A (zh) | 2018-09-21 |
CN108564955B (zh) | 2019-09-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019179029A1 (fr) | Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur | |
WO2019179036A1 (fr) | Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage | |
JP6429945B2 (ja) | 音声データを処理するための方法及び装置 | |
US6490560B1 (en) | Method and system for non-intrusive speaker verification using behavior models | |
WO2019119505A1 (fr) | Procédé et dispositif de reconnaissance faciale, dispositif informatique et support d'enregistrement | |
US7689418B2 (en) | Method and system for non-intrusive speaker verification using behavior models | |
WO2018166187A1 (fr) | Serveur, procédé et système de vérification d'identité, et support d'informations lisible par ordinateur | |
US10650379B2 (en) | Method and system for validating personalized account identifiers using biometric authentication and self-learning algorithms | |
US11482050B2 (en) | Intelligent gallery management for biometrics | |
CN108989349B (zh) | 用户账号解锁方法、装置、计算机设备及存储介质 | |
US11062120B2 (en) | High speed reference point independent database filtering for fingerprint identification | |
WO2019136911A1 (fr) | Procédé et appareil de reconnaissance vocale, dispositif terminal et support de stockage | |
US12069047B2 (en) | Using an enrolled biometric dataset to detect adversarial examples in biometrics-based authentication system | |
WO2023134232A1 (fr) | Procédé, appareil et dispositif de mise à jour d'une base de données de vecteurs de caractéristiques, et support | |
EP1470549A1 (fr) | Procede et dispositif de verification discrete des locuteurs au moyen de modeles comportementaux | |
CN116561737A (zh) | 基于用户行为基线的密码有效性检测方法及其相关设备 | |
US12014141B2 (en) | Systems and methods for improved transaction categorization using natural language processing | |
WO2023078115A1 (fr) | Procédé de vérification d'informations, serveur, et support de stockage | |
CN117373082A (zh) | 一种人脸识别方法、装置、设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18910797 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.01.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18910797 Country of ref document: EP Kind code of ref document: A1 |