WO2019179029A1 - Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur - Google Patents

Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur Download PDF

Info

Publication number
WO2019179029A1
WO2019179029A1 PCT/CN2018/102105 CN2018102105W WO2019179029A1 WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1 CN 2018102105 W CN2018102105 W CN 2018102105W WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
voice data
preset
neural network
average
Prior art date
Application number
PCT/CN2018/102105
Other languages
English (en)
Chinese (zh)
Inventor
赵峰
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019179029A1 publication Critical patent/WO2019179029A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present application relates to the field of voiceprint recognition technologies, and in particular, to an electronic device, an authentication method, and a computer readable storage medium.
  • Speaker recognition commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.
  • Speaker recognition includes text-related recognition and text-independent recognition.
  • text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past.
  • the existing text-independent speaker recognition technology is not accurate and error-prone.
  • the main purpose of the present application is to provide an electronic device, an authentication method, and a computer readable storage medium, which are intended to improve the accuracy of speaker authentication.
  • an electronic device proposed by the present application includes a memory and a processor, and the memory stores an identity verification system executable on the processor, where the identity verification system is implemented by the processor The following steps:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the processor is further configured to execute the identity verification system to implement the following steps:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the process of the iterative training of the preset structure deep neural network model comprises:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the network structure of the preset structure deep neural network model is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the application also provides an authentication method, which includes:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification method before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the identity verification method further includes the steps of:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the network structure of the preset structure deep neural network model is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to perform the following steps:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the technical solution of the present application first performs frame processing on the current voice data of the target user that receives the identity to be verified and the standard voice data to be verified, and extracts the pre-preparation of each voice frame obtained by using the preset filter.
  • the type acoustic characteristics are input, and the extracted preset type acoustic features are input into the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics and standards corresponding to the current voice data.
  • the preset type acoustic characteristics corresponding to the voice data are converted into corresponding feature vectors
  • the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to the present application.
  • FIG. 2 is a schematic flow chart of a training process of a preset structure deep neural network model according to the present application
  • FIG. 3 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application.
  • FIG. 4 is a block diagram of a program of an embodiment of an identity verification system of the present application.
  • FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
  • FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to an application.
  • the identity verification method includes:
  • Step S10 After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • Step S20 extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • Step S30 respectively input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice.
  • Step S40 Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • the method further includes the following steps:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
  • non-speaker voices eg, mute or noise
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
  • the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
  • the verification set is used to verify the accuracy of the preset structure deep neural network model
  • the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
  • the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
  • Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
  • the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
  • the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
  • the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
  • the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
  • step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
  • the process of the iterative training of the preset structure deep neural network model includes:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
  • the network structure of the preset structure deep neural network model of this embodiment is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the present application also proposes an identity verification system.
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 is installed and operates in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
  • the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • FIG. 4 is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
  • the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104.
  • a module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:
  • the framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • the extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • the calculating module 103 is configured to input the preset type acoustic features corresponding to the extracted current voice frame group and the preset type acoustic features corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the a feature vector of a preset length corresponding to each of the current voice data and the standard voice data;
  • the result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • FIG. 5 is a program module diagram of a second embodiment of the identity verification system of the present application.
  • the identity verification system further includes:
  • the detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the training process of the preset structure deep neural network model is (refer to FIG. 2):
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
  • non-speaker voices eg, mute or noise
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
  • the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
  • the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
  • the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
  • Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
  • the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
  • the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
  • the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
  • the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
  • step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
  • the process of the iterative training of the preset structure deep neural network model includes:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
  • the network structure of the preset structure deep neural network model of this embodiment is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un dispositif électronique, un procédé de vérification d'identité et un support d'informations, le procédé consistant : lors de la réception de données vocales actuelles d'un utilisateur cible dont l'identité doit être vérifiée, à acquérir des données vocales standard correspondant à l'identité à vérifier et à effectuer un traitement de tramage sur les deux éléments de données vocales standard pour obtenir un groupe de trames vocales actuelles et un groupe de trames vocales standard (S10) ; à utiliser un filtre prédéfini pour extraire respectivement un type prédéfini de caractéristique acoustique de chaque trame vocale des deux groupes de trames vocales (S20) ; à entrer respectivement le type prédéfini extrait de caractéristique acoustique dans un modèle de réseau neuronal profond pré-instruit d'une structure prédéfinie, afin d'obtenir un vecteur de caractéristiques d'une longueur prédéfinie, correspondant respectivement aux données vocales actuelles et aux données vocales standard (S30) ; et à calculer une similarité cosinus entre les deux vecteurs de caractéristiques et à déterminer un résultat de vérification d'identité en fonction du degré de la similarité calculée cosinus (S40). On peut ainsi améliorer la précision de la vérification d'identité d'un locuteur.
PCT/CN2018/102105 2018-03-19 2018-08-24 Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur WO2019179029A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810225887.2A CN108564955B (zh) 2018-03-19 2018-03-19 电子装置、身份验证方法和计算机可读存储介质
CN201810225887.2 2018-03-19

Publications (1)

Publication Number Publication Date
WO2019179029A1 true WO2019179029A1 (fr) 2019-09-26

Family

ID=63532742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102105 WO2019179029A1 (fr) 2018-03-19 2018-08-24 Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN108564955B (fr)
WO (1) WO2019179029A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (zh) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 一种方言识别模型的训练方法、可读存储介质及终端设备
CN114648978A (zh) * 2022-04-27 2022-06-21 腾讯科技(深圳)有限公司 一种语音验证处理的方法以及相关装置

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564954B (zh) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 深度神经网络模型、电子装置、身份验证方法和存储介质
CN110164452B (zh) * 2018-10-10 2023-03-10 腾讯科技(深圳)有限公司 一种声纹识别的方法、模型训练的方法以及服务器
CN109346086A (zh) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 声纹识别方法、装置、计算机设备和计算机可读存储介质
US10887317B2 (en) * 2018-11-28 2021-01-05 Sap Se Progressive authentication security adapter
CN110148402A (zh) * 2019-05-07 2019-08-20 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质
CN110570871A (zh) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 一种基于TristouNet的声纹识别方法、装置及设备
CN111933153B (zh) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 一种语音分割点的确定方法和装置
CN112016673A (zh) * 2020-07-24 2020-12-01 浙江工业大学 一种基于优化lstm的移动设备用户认证方法及装置
CN112309365B (zh) * 2020-10-21 2024-05-10 北京大米科技有限公司 语音合成模型的训练方法、装置、存储介质以及电子设备
CN112347788A (zh) * 2020-11-06 2021-02-09 平安消费金融有限公司 语料处理方法、装置及存储介质
CN113178197B (zh) * 2021-04-27 2024-01-09 平安科技(深圳)有限公司 语音验证模型的训练方法、装置以及计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
CN106205624A (zh) * 2016-07-15 2016-12-07 河海大学 一种基于dbscan算法的声纹识别方法
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
CN106205624A (zh) * 2016-07-15 2016-12-07 河海大学 一种基于dbscan算法的声纹识别方法
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (zh) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 一种方言识别模型的训练方法、可读存储介质及终端设备
CN114648978A (zh) * 2022-04-27 2022-06-21 腾讯科技(深圳)有限公司 一种语音验证处理的方法以及相关装置

Also Published As

Publication number Publication date
CN108564955A (zh) 2018-09-21
CN108564955B (zh) 2019-09-03

Similar Documents

Publication Publication Date Title
WO2019179029A1 (fr) Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur
WO2019179036A1 (fr) Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage
JP6429945B2 (ja) 音声データを処理するための方法及び装置
US6490560B1 (en) Method and system for non-intrusive speaker verification using behavior models
WO2019119505A1 (fr) Procédé et dispositif de reconnaissance faciale, dispositif informatique et support d'enregistrement
US7689418B2 (en) Method and system for non-intrusive speaker verification using behavior models
WO2018166187A1 (fr) Serveur, procédé et système de vérification d'identité, et support d'informations lisible par ordinateur
US10650379B2 (en) Method and system for validating personalized account identifiers using biometric authentication and self-learning algorithms
US11482050B2 (en) Intelligent gallery management for biometrics
CN108989349B (zh) 用户账号解锁方法、装置、计算机设备及存储介质
US11062120B2 (en) High speed reference point independent database filtering for fingerprint identification
WO2019136911A1 (fr) Procédé et appareil de reconnaissance vocale, dispositif terminal et support de stockage
US12069047B2 (en) Using an enrolled biometric dataset to detect adversarial examples in biometrics-based authentication system
WO2023134232A1 (fr) Procédé, appareil et dispositif de mise à jour d'une base de données de vecteurs de caractéristiques, et support
EP1470549A1 (fr) Procede et dispositif de verification discrete des locuteurs au moyen de modeles comportementaux
CN116561737A (zh) 基于用户行为基线的密码有效性检测方法及其相关设备
US12014141B2 (en) Systems and methods for improved transaction categorization using natural language processing
WO2023078115A1 (fr) Procédé de vérification d'informations, serveur, et support de stockage
CN117373082A (zh) 一种人脸识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18910797

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18910797

Country of ref document: EP

Kind code of ref document: A1