WO2019179029A1 - Electronic device, identity verification method and computer-readable storage medium - Google Patents

Electronic device, identity verification method and computer-readable storage medium Download PDF

Info

Publication number
WO2019179029A1
WO2019179029A1 PCT/CN2018/102105 CN2018102105W WO2019179029A1 WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1 CN 2018102105 W CN2018102105 W CN 2018102105W WO 2019179029 A1 WO2019179029 A1 WO 2019179029A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
voice data
preset
neural network
average
Prior art date
Application number
PCT/CN2018/102105
Other languages
French (fr)
Chinese (zh)
Inventor
赵峰
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019179029A1 publication Critical patent/WO2019179029A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present application relates to the field of voiceprint recognition technologies, and in particular, to an electronic device, an authentication method, and a computer readable storage medium.
  • Speaker recognition commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.
  • Speaker recognition includes text-related recognition and text-independent recognition.
  • text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past.
  • the existing text-independent speaker recognition technology is not accurate and error-prone.
  • the main purpose of the present application is to provide an electronic device, an authentication method, and a computer readable storage medium, which are intended to improve the accuracy of speaker authentication.
  • an electronic device proposed by the present application includes a memory and a processor, and the memory stores an identity verification system executable on the processor, where the identity verification system is implemented by the processor The following steps:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the processor is further configured to execute the identity verification system to implement the following steps:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the process of the iterative training of the preset structure deep neural network model comprises:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the network structure of the preset structure deep neural network model is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the application also provides an authentication method, which includes:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification method before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the identity verification method further includes the steps of:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the network structure of the preset structure deep neural network model is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to perform the following steps:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the technical solution of the present application first performs frame processing on the current voice data of the target user that receives the identity to be verified and the standard voice data to be verified, and extracts the pre-preparation of each voice frame obtained by using the preset filter.
  • the type acoustic characteristics are input, and the extracted preset type acoustic features are input into the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics and standards corresponding to the current voice data.
  • the preset type acoustic characteristics corresponding to the voice data are converted into corresponding feature vectors
  • the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to the present application.
  • FIG. 2 is a schematic flow chart of a training process of a preset structure deep neural network model according to the present application
  • FIG. 3 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application.
  • FIG. 4 is a block diagram of a program of an embodiment of an identity verification system of the present application.
  • FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
  • FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to an application.
  • the identity verification method includes:
  • Step S10 After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • Step S20 extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • Step S30 respectively input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice.
  • Step S40 Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • the method further includes the following steps:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the training process of the preset structure deep neural network model is:
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
  • non-speaker voices eg, mute or noise
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
  • the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
  • the verification set is used to verify the accuracy of the preset structure deep neural network model
  • the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
  • the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
  • Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
  • the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
  • the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
  • the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
  • the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
  • step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
  • the process of the iterative training of the preset structure deep neural network model includes:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
  • the network structure of the preset structure deep neural network model of this embodiment is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the present application also proposes an identity verification system.
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 is installed and operates in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
  • the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • FIG. 4 is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
  • the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104.
  • a module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:
  • the framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • the extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • the calculating module 103 is configured to input the preset type acoustic features corresponding to the extracted current voice frame group and the preset type acoustic features corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the a feature vector of a preset length corresponding to each of the current voice data and the standard voice data;
  • the result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result.
  • the scheme is for the speaker identity verification. More accurate and reliable.
  • FIG. 5 is a program module diagram of a second embodiment of the identity verification system of the present application.
  • the identity verification system further includes:
  • the detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the training process of the preset structure deep neural network model is (refer to FIG. 2):
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
  • non-speaker voices eg, mute or noise
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
  • the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
  • the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
  • the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
  • Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics
  • the feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters.
  • the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • the verification set that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time.
  • calculate the accuracy rate according to the correct number of verification results for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
  • the verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model;
  • the accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
  • step S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
  • the process of the iterative training of the preset structure deep neural network model includes:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
  • the network structure of the preset structure deep neural network model of this embodiment is as follows:
  • the first layer is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
  • the second layer is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average.
  • Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
  • the third layer is the deep neural network DNN fully connected layer
  • the fourth layer is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.

Abstract

An electronic device, an identity verification method and a storage medium, wherein the method comprises: when receiving current voice data of a target user of which the identity is to be verified, acquiring standard voice data corresponding to the identity to be verified, and performing framing processing on the two pieces of standard voice data to obtain a current voice frame group and a standard voice frame group (S10); using a pre-set filter to respectively extract a pre-set type of acoustic feature of each voice frame of the two voice frame groups (S20); respectively inputting the extracted pre-set type of acoustic feature into a pre-trained deep neural network model of a pre-set structure to obtain a feature vector of a pre-set length respectively corresponding to current voice data and standard voice data (S30); and calculating cosine similarity between the two feature vectors, and determining an identity verification result according to the magnitude of the calculated cosine similarity (S40). The accuracy of identity verification of a speaker can be improved.

Description

电子装置、身份验证方法和计算机可读存储介质Electronic device, authentication method, and computer readable storage medium
本申请基于巴黎公约申明享有2018年3月19日递交的申请号为CN 2018102258872、名称为“电子装置、身份验证方法和计算机可读存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Authentication Method and Computer-Readable Storage Media", filed on March 19, 2018, with the application number of CN 2018102258872, the entire disclosure of which is hereby incorporated by reference. The content is incorporated herein by reference.
技术领域Technical field
本申请涉及声纹识别技术领域,特别涉及一种电子装置、身份验证方法和计算机可读存储介质。The present application relates to the field of voiceprint recognition technologies, and in particular, to an electronic device, an authentication method, and a computer readable storage medium.
背景技术Background technique
说话人识别通常称为声纹识别,是生物识别技术的一种,常被用来确认某段语音是否是指定的某个人所说,是“一对一判别”问题。说话人识别广泛应用于诸多领域,例如,在金融、证券、社保、公安、军队及其他民用安全认证等领域都有着广泛的应用需求。Speaker recognition, commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.
说话人识别包括文本相关识别和文本无关识别两种方式,近年来文本无关说话人识别技术不断突破,其准确性较之以往有了极大的提升。然而在某些受限情况下,比如采集到的说话人有效语音较短(时长小于5秒的语音)的情况下,现有的文本无关说话人识别技术的准确性不高,很容易出错。Speaker recognition includes text-related recognition and text-independent recognition. In recent years, text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past. However, in some limited cases, such as the case where the collected speaker has a short effective voice (speech of less than 5 seconds), the existing text-independent speaker recognition technology is not accurate and error-prone.
发明内容Summary of the invention
本申请的主要目的是提供一种电子装置、身份验证方法和计算机可读存储介质,旨在提升说话人身份验证的准确性。The main purpose of the present application is to provide an electronic device, an authentication method, and a computer readable storage medium, which are intended to improve the accuracy of speaker authentication.
为实现上述目的,本申请提出的电子装置,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的身份验证系统,所述身份验证系统被所述处理器执行时实现如下步骤:To achieve the above objective, an electronic device proposed by the present application includes a memory and a processor, and the memory stores an identity verification system executable on the processor, where the identity verification system is implemented by the processor The following steps:
在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;
计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
优选地,在将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理的步骤之前,该处理器还用于执行所述身份验证系统,以实现以下步骤:Preferably, before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the processor is further configured to execute the identity verification system to implement the following steps:
分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
优选地,所述预设结构深度神经网络模型的训练过程为:Preferably, the training process of the preset structure deep neural network model is:
S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
优选地,所述预设结构深度神经网络模型迭代训练的过程包括:Preferably, the process of the iterative training of the preset structure deep neural network model comprises:
根据模型的当前参数将当前输入每个语音帧组对应的预设类型声学特征转化为对应的一个预设长度的特征矢量;Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;
从各个特征矢量中进行随机选取以获得多个三元组,第i个三元组(x i1,x i2,x i3)由三个不同的特征矢量x i1、x i2和x i3组成,其中,x i1和x i2对应同一个说话人,x i1和x i3对应不同的说话人,i为正整数; Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
采用预先确定的计算公式计算x i1和x i2之间的余弦相似度
Figure PCTCN2018102105-appb-000001
并计算x i1和x i3之间的余弦相似度
Figure PCTCN2018102105-appb-000002
Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
Figure PCTCN2018102105-appb-000001
And calculate the cosine similarity between x i1 and x i3
Figure PCTCN2018102105-appb-000002
根据余弦相似度
Figure PCTCN2018102105-appb-000003
及预先确定的损失函数L更新模型的参数,所述预先确定的损失函数L的公式为:
Figure PCTCN2018102105-appb-000004
其中α是取值范围在0.05~0.2之间常量,N是获得的三元组的个数。
Cosine similarity
Figure PCTCN2018102105-appb-000003
And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
Figure PCTCN2018102105-appb-000004
Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
优选地,所述预设结构深度神经网络模型的网络结构如下:Preferably, the network structure of the preset structure deep neural network model is as follows:
第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3 层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
第五层:是损失层,损失函数L的公式为:
Figure PCTCN2018102105-appb-000005
其中α是取值范围在0.05~0.2之间的常量,
Figure PCTCN2018102105-appb-000006
代表属于同一说话人的两个特征矢量的余弦相似度,
Figure PCTCN2018102105-appb-000007
代表不属于同一说话人的两个特征矢量的余弦相似度。
The fifth layer: is the loss layer, the formula of the loss function L is:
Figure PCTCN2018102105-appb-000005
Where α is a constant with a value ranging from 0.05 to 0.2.
Figure PCTCN2018102105-appb-000006
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Figure PCTCN2018102105-appb-000007
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
本申请还提出一种身份验证方法,该身份验证方法包括:The application also provides an authentication method, which includes:
在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;
计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
优选地,在将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理的步骤之前,所述身份验证方法还包括步骤:Preferably, before the step of performing the framing processing on the current voice data and the standard voice data according to preset framing parameters, the identity verification method further includes the steps of:
分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
优选地,所述预设结构深度神经网络模型的训练过程为:Preferably, the training process of the preset structure deep neural network model is:
S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再 利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
优选地,所述预设结构深度神经网络模型的网络结构如下:Preferably, the network structure of the preset structure deep neural network model is as follows:
第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
第五层:是损失层,损失函数L的公式为:
Figure PCTCN2018102105-appb-000008
其中α是取值范围在0.05~0.2之间的常量,
Figure PCTCN2018102105-appb-000009
代表属于同一说话人的两个特征矢量的余弦相似度,
Figure PCTCN2018102105-appb-000010
代表不属于同一说话人的两个特征矢量的余弦相似度。
The fifth layer: is the loss layer, the formula of the loss function L is:
Figure PCTCN2018102105-appb-000008
Where α is a constant with a value ranging from 0.05 to 0.2.
Figure PCTCN2018102105-appb-000009
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Figure PCTCN2018102105-appb-000010
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
本申请还提出一种计算机可读存储介质,所述计算机可读存储介质存储有身份验证系统,所述身份验证系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:The application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to perform the following steps:
在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量,计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the Determining a cosine similarity of the two feature vectors corresponding to the preset feature vectors of the standard speech data, and determining an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and Verify the failure result.
本申请技术方案通过将接收到待验证身份的目标用户的当前语音数据和待验证身份的标准语音数据先进行分帧处理,利用预设滤波器提取分帧处理得到的各个语音帧的提取出预设类型声学特征,再将提取出的预设类型声学特征输入到预先训练好的预设结构深度神经网络模型,预设结构深度神经网络模型分别将当前语音数据对应的预设类型声学特征和标准语音数据对应的预设类型声学特征转化为对应的特征向量后,计算两个特征向量的余弦相似度,根据余弦相似度大小确认验证结果。本实施例技术方案,通过将语音数据先分帧处理为多个语音帧并根据语音帧提取预设类型声学特征,使得即使在采集到的有效语音数据很短时,也能提取根据采集到的语音数据提取得到足够多的声学特征,再通过训练好的深度神经网络模型根据提取出得到声学特征进行处理,以输出验证结果,相较于现有技术而言,本方案对说话人身份验证的准确性和可靠性更高。The technical solution of the present application first performs frame processing on the current voice data of the target user that receives the identity to be verified and the standard voice data to be verified, and extracts the pre-preparation of each voice frame obtained by using the preset filter. The type acoustic characteristics are input, and the extracted preset type acoustic features are input into the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics and standards corresponding to the current voice data. After the preset type acoustic characteristics corresponding to the voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude. In the technical solution of the embodiment, the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected The speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result. Compared with the prior art, the scheme is for the speaker identity verification. More accurate and reliable.
附图说明DRAWINGS
图1为本申请身份验证方法一实施例的流程示意图;1 is a schematic flowchart of an embodiment of an identity verification method according to the present application;
图2为本申请预设结构深度神经网络模型训练过程的流程示意图;2 is a schematic flow chart of a training process of a preset structure deep neural network model according to the present application;
图3为本申请身份验证系统一实施例的运行环境示意图;3 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application;
图4为本申请身份验证系统一实施例的程序模块图;4 is a block diagram of a program of an embodiment of an identity verification system of the present application;
图5为本申请身份验证系统二实施例的程序模块图。FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
具体实施方式detailed description
以下结合附图对本申请的原理和特征进行描述,所举实例只用于解释本申请,并非用于限定本申请的范围。The principles and features of the present application are described in the following with reference to the accompanying drawings, which are only used to explain the present application and are not intended to limit the scope of the application.
如图1所示,图1为本申请身份验证方法一实施例的流程示意图。As shown in FIG. 1, FIG. 1 is a schematic flowchart of an embodiment of an identity verification method according to an application.
本实施例中,该身份验证方法包括:In this embodiment, the identity verification method includes:
步骤S10,在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;Step S10: After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
身份验证系统的数据库中预先存储有每个身份的标准语音数据,在收到待进行身份验证的目标用户的当前语音数据后,根据目标用户要求验证的身份(待验证的身份),身份验证系统在数据库中获取该待验证的身份对应的标准语音数据,然后再分别对接收到的当前语音数据和获取到的标准语音数据按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组(包括当前语音数据经分帧得到的多个语音帧)和所述标准语音数据对应的标准语音帧组(包括标准语音数据经分帧得到的多个语音帧)。其中,所述预设的分帧参数例如,每隔25毫秒分帧,帧移10毫秒。The identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice. The current voice frame group corresponding to the data (including a plurality of voice frames obtained by dividing the current voice data) and the standard voice frame group corresponding to the standard voice data (including a plurality of voice frames obtained by dividing the standard voice data). The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
步骤S20,利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设 类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Step S20: extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
在得到当前语音帧组和标准语音帧组后,身份验证系统在利用预设滤波器分别对当前语音帧组和标准语音帧组中的各个语音帧进行特征提取,以提取出当前语音帧组中的各个语音帧对应的预设类型声学特征和标准语音帧组中的各个语音帧对应的预设类型声学特征。例如,该预设滤波器为梅尔(Mel)滤波器,提取出的预设类型声学特征为36维MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)频谱特征。After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group. The preset type acoustic features corresponding to the respective voice frames and the preset type acoustic features corresponding to the respective voice frames in the standard voice frame group. For example, the preset filter is a Mel filter, and the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
步骤S30,分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;Step S30, respectively input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice. a feature vector of a preset length corresponding to the data and the standard voice data;
步骤S40,计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Step S40: Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
身份验证系统中具有预先训练好的预设结构深度神经网络模型,该模型为采用样本语音数据的对应的预设类型声学特征迭代训练的模型;身份验证系统在对当前语音帧组和标准语音帧组中的语音帧进行特征提取后,将当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入该预先训练好的预设结构深度神经网络模型中,模型将当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征分别转化为一个预设长度的特征矢量(例如,长度为1的特征矢量),再计算得到的两个特征矢量的余弦相似度,根据计算出的余弦相似度的大小确定身份验证结果,即将该余弦相似度与预设阈值(例如0.95)比较,若该余弦相似度大于预设阈值,则确定身份验证通过,反之,则确定身份验证失败。其中,余弦相似度计算公式为:cos(x i,x j)=x i Tx j,x i和x j代表两个特征矢量,T为预先确定值。 The authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame. After the feature frame is extracted from the voice frame in the group, the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and determining the identity if the cosine similarity is greater than a preset threshold The verification passes, and vice versa, it determines that the authentication failed. Wherein, the cosine similarity calculation formula is: cos(x i , x j )=x i T x j , x i and x j represent two feature vectors, and T is a predetermined value.
本实施例技术方案,通过将接收到待验证身份的目标用户的当前语音数据和待验证身份的标准语音数据先进行分帧处理,利用预设滤波器提取分帧处理得到的各个语音帧的提取出预设类型声学特征,再将提取出的预设类型声学特征输入到预先训练好的预设结构深度神经网络模型,预设结构深度神经网络模型分别将当前语音数据对应的预设类型声学特征和标准语音数据对应的预设类型声学特征转化为对应的特征向量后,计算两个特征向量的余弦相似度,根据余弦相似度大小确认验证结果。本实施例技术方案,通过将语音数据先分帧处理为多个语音帧并根据语音帧提取预设类型声学特征,使得即使在采集到的有效语音数据很短时,也能提取根据采集到的语音数据提取得到足够多的声学特征,再通过训练好的深度神经网络模型根据提取出得到声学特征进行处理,以输出验证结果,相较于现有技术而言,本方案对说话人身份验证的准确性和可靠性更高。In the technical solution of the embodiment, the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter. The preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data. After the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude. In the technical solution of the embodiment, the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected The speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result. Compared with the prior art, the scheme is for the speaker identity verification. More accurate and reliable.
进一步地,本实施例在将所述当前语音数据和标准语音数据分别按照预 设的分帧参数进行分帧处理的步骤之前,所述身份验证方法还包括步骤:Further, before the step of performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, the method further includes the following steps:
分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
在采集的当前语音数据和预先存储的标准语音数据中都包含一些非说话人语音部分(例如,静音或噪音),如果这些部分不删除掉,则对当前语音数据或标准对语音数据进行分帧处理后得到的语音帧组中,会出现包含非说话人语音部分的语音帧(甚至个别语音帧中全为非说话人语音),这样,利用预设滤波器根据这些包含非说话人语音部分的语音帧提取出的预设类型声学特征属于杂质特征,会降低预设结构深度神经网络模型得出结果的准确性;故本实施例在对语音数据分帧处理之前,先检测当前语音数据和标准语音数据中的非说话人语音部分,并将检测到的非说话人语音部分删除,本实施例采用的非说话人语音部分的检测方式为活动端点检测(Voice Activity Detection,VAD)。Some non-speaker voice parts (for example, mute or noise) are included in the collected current voice data and the pre-stored standard voice data, and if the parts are not deleted, the current voice data or the standard pair voice data is framed. In the speech frame group obtained after processing, a speech frame containing a non-speaker speech portion (even all non-speaker speech in an individual speech frame) may appear, so that a predetermined filter is used according to these non-speaker speech portions. The preset type acoustic feature extracted by the speech frame belongs to the impurity feature, which reduces the accuracy of the result obtained by the preset structure depth neural network model; therefore, the present embodiment detects the current voice data and the standard before processing the voice data frame. The non-speaker voice part of the voice data is deleted, and the detected non-speaker voice part is deleted. The non-speaker voice part of the embodiment is detected by a voice activity detection (VAD).
如图2所示,本实施例中,所述预设结构深度神经网络模型的训练过程为:As shown in FIG. 2, in this embodiment, the training process of the preset structure deep neural network model is:
S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
先准备好预设数量(例如,10000个)语音数据样本,各个语音数据样本都是已知说话人身份的语音数据;这些语音数据样本中,每一个说话人身份或部分的说话人身份对应有多个语音数据样本,将各个语音数据样本标注上代表对应的说话人身份的标签。First, a preset number (for example, 10000) of voice data samples are prepared, and each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
对语音数据样本进行活动端点检测,以检测出每个语音数据样本中的非说话人的语音(例如,静音或噪音)并删除,避免语音数据样本中存在与对应的说话人身份的声纹特征无关的语音数据,而影响对模型的训练效果。Performing active endpoint detection on the voice data samples to detect and delete non-speaker voices (eg, mute or noise) in each voice data sample to avoid the presence of voiceprint features in the voice data samples and corresponding speaker identity Irrelevant speech data, which affects the training effect on the model.
S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
例如,将得到的标准语音数据样本的70%作为训练集,30%作为验证集。For example, 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
其中,预设的分帧参数例如,每隔25毫秒分帧,帧移10毫秒;该预设滤波器例如为梅尔滤波器,通过梅尔滤波器提取出的预设类型声学特征为MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)频谱特征,例如,36维MFCC频谱特征。The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds; the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M 批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5, dividing the preset type acoustic features corresponding to each voice frame group in the training set into M batches, inputting the iterative training in the preset structure deep neural network model in batches, and deepening the nerves in the preset structure After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
对训练集中的预设类型声学特征进行分批处理,划分成M(例如30)批,分批方式可按照语音帧组为分配单位,每一批中分配等量或不等量的语音帧组对应的预设类型声学特征;将训练集中的各个语音帧组对应的预设类型声学特征按照分成的批次逐一的输入预设结构深度神经网络模型中进行迭代训练,每一批预设类型声学特征使所述预设结构胜读神经网络模型迭代一次,每次迭代都会更新得到新的模型参数,通过多次迭代训练完成后,该预设结构深度神经网络模型已经更新为较佳的模型参数;迭代训练完成后,则利用验证集对该预设结构深度神经网络模型的准确率进行验证,即将验证集中的标准语音数据两两分组,每次输入一个分组中的标准语音数据样本对应的预设类型声学特征到该预设结构深度神经网络模型,根据输入的两个标准语音数据的身份标签,确认输出的验证结构是否正确,在完成对各个分组的验证后,根据验证结果正确次数计算准确率,例如对100个分组进行验证,最终得到验证结果正确的有99组,则准确率就为99%。The preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches. The batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch. Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics The feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters. After the iterative training is completed, the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time. Set the type acoustic feature to the preset structure depth neural network model, based on the identity of the input two standard voice data To confirm whether the verification structure of the output is correct, after completing the verification of each group, calculate the accuracy rate according to the correct number of verification results, for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
系统中预先设置了准确率的验证阈值(即所述预设阈值,例如98.5%),用于对所述预设结构深度神经网络模型的训练效果进行检验;若通过所述验证集对所述预设结构深度神经网络模型验证得到的准确率大于所述预设阈值,那么说明该预设结构深度神经网络模型的训练达到了标准,此时则结束模型训练。The verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model; The accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
若是通过所述验证集对所述预设结构深度神经网络模型验证得到的准确率小于或等于所述预设阈值,那么说明该预设结构深度神经网络模型的训练还没有达到了预期标准,可能是训练集数量不够或验证集数量不够,所以,在这种情况时,则增加获取的语音数据样本的数量(例如,每次增加固定数量或每次增加随机数量),然后在这基础上,重新执行上述步骤S1-S5,如此循环执行,直至达到了步骤S6的要求,则结束模型训练。If the accuracy of the verification of the preset structure depth neural network model by the verification set is less than or equal to the preset threshold, then the training of the preset structure deep neural network model has not reached the expected standard, and Is the number of training sets insufficient or the number of verification sets is insufficient, so in this case, increase the number of acquired speech data samples (for example, increase the fixed number each time or increase the random number each time), and then, based on this, The above steps S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
本实施例中,所述预设结构深度神经网络模型迭代训练的过程包括:In this embodiment, the process of the iterative training of the preset structure deep neural network model includes:
根据模型的当前参数将当前输入每个语音帧组对应的预设类型声学特征转化为对应的一个预设长度的特征矢量;Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;
从各个特征矢量中进行随机选取以获得多个三元组,第i个三元组(x i1,x i2,x i3)由三个不同的特征矢量x i1、x i2和x i3组成,其中,x i1和x i2对应同一个说话人,x i1和x i3对应不同的说话人,i为正整数; Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
采用预先确定的计算公式计算x i1和x i2之间的余弦相似度
Figure PCTCN2018102105-appb-000011
并计算x i1和x i3之间的余弦相似度
Figure PCTCN2018102105-appb-000012
Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
Figure PCTCN2018102105-appb-000011
And calculate the cosine similarity between x i1 and x i3
Figure PCTCN2018102105-appb-000012
根据余弦相似度
Figure PCTCN2018102105-appb-000013
及预先确定的损失函数L更新模型的参数,所述预先确定的损失函数L的公式为:
Figure PCTCN2018102105-appb-000014
其中α是取值范围在0.05~0.2之间常量,N是获得的三元组的个数。
Cosine similarity
Figure PCTCN2018102105-appb-000013
And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
Figure PCTCN2018102105-appb-000014
Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
其中,模型参数更新步骤为:1.采用反向传播算法计算该预设结构深度神经网络的梯度;2.采用mini-batch-SGD(小批量随机梯度下降)方法更新该预设结构深度神经网络的参数。The model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
进一步地,本实施例的所述预设结构深度神经网络模型的网络结构如下:Further, the network structure of the preset structure deep neural network model of this embodiment is as follows:
第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;前向LSTM和后向LSTM分别输出一个矢量序列;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
第五层:是损失层,损失函数L的公式为:
Figure PCTCN2018102105-appb-000015
其中α是取值范围在0.05~0.2之间的常量,
Figure PCTCN2018102105-appb-000016
代表属于同一说话人的两个特征矢量的余弦相似度,
Figure PCTCN2018102105-appb-000017
代表不属于同一说话人的两个特征矢量的余弦相似度。
The fifth layer: is the loss layer, the formula of the loss function L is:
Figure PCTCN2018102105-appb-000015
Where α is a constant with a value ranging from 0.05 to 0.2.
Figure PCTCN2018102105-appb-000016
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Figure PCTCN2018102105-appb-000017
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
此外,本申请还提出一种身份验证系统。In addition, the present application also proposes an identity verification system.
请参阅图3,是本申请身份验证系统10较佳实施例的运行环境示意图。Please refer to FIG. 3 , which is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.
在本实施例中,身份验证系统10安装并运行于电子装置1中。电子装置1可以是桌上型计算机、笔记本、掌上电脑及服务器等计算设备。该电子装置1可包括,但不仅限于,存储器11、处理器12及显示器13。图3仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In the present embodiment, the identity verification system 10 is installed and operates in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
存储器11在一些实施例中可以是电子装置1的内部存储单元,例如该电子装置1的硬盘或内存。存储器11在另一些实施例中也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括电子装置1的内部存储单元也包括外部存储设备。存储器11用于存储安装于电子装置1的应用软件及各类数据,例如身份验证系统10的程序代码等。存储器11还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like. The memory 11 can also be used to temporarily store data that has been output or is about to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行身份验证系统10等。The processor 12, in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
显示器13在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器13用于显示在电子装置1中处理的信息以及用于显示可视化的用户界面,例如业务定制界面等。电子装置1的部件11-13通过系统总线相互通信。The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments. The display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like. The components 11-13 of the electronic device 1 communicate with one another via a system bus.
请参阅图4,是本申请身份验证系统10较佳实施例的程序模块图。在本实施例中,身份验证系统10可以被分割成一个或多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行,以完成本申请。例如,在图4中,身份验证系统10可以被分割成分帧模块101、提取模块102、计算模块103及结果确定模块104。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述身份验证系统10在电子装置1中的执行过程,其中:Please refer to FIG. 4, which is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application. In this embodiment, the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application. For example, in FIG. 4, the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104. A module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:
分帧模块101,用于在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;The framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
身份验证系统的数据库中预先存储有每个身份的标准语音数据,在收到待进行身份验证的目标用户的当前语音数据后,根据目标用户要求验证的身份(待验证的身份),身份验证系统在数据库中获取该待验证的身份对应的标准语音数据,然后再分别对接收到的当前语音数据和获取到的标准语音数据按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组(包括当前语音数据经分帧得到的多个语音帧)和所述标准语音数据对应的标准语音帧组(包括标准语音数据经分帧得到的多个语音帧)。其中,所述预设的分帧参数例如,每隔25毫秒分帧,帧移10毫秒。The identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice. The current voice frame group corresponding to the data (including a plurality of voice frames obtained by dividing the current voice data) and the standard voice frame group corresponding to the standard voice data (including a plurality of voice frames obtained by dividing the standard voice data). The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
提取模块102,用于利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;The extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
在得到当前语音帧组和标准语音帧组后,身份验证系统在利用预设滤波器分别对当前语音帧组和标准语音帧组中的各个语音帧进行特征提取,以提取出当前语音帧组中的各个语音帧对应的预设类型声学特征和标准语音帧组中的各个语音帧对应的预设类型声学特征。例如,该预设滤波器为梅尔(Mel)滤波器,提取出的预设类型声学特征为36维MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)频谱特征。After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group. The preset type acoustic features corresponding to the respective voice frames and the preset type acoustic features corresponding to the respective voice frames in the standard voice frame group. For example, the preset filter is a Mel filter, and the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
计算模块103,用于分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;The calculating module 103 is configured to input the preset type acoustic features corresponding to the extracted current voice frame group and the preset type acoustic features corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the a feature vector of a preset length corresponding to each of the current voice data and the standard voice data;
结果确定模块104,用于计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。The result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
身份验证系统中具有预先训练好的预设结构深度神经网络模型,该模型为采用样本语音数据的对应的预设类型声学特征迭代训练的模型;身份验证系统在对当前语音帧组和标准语音帧组中的语音帧进行特征提取后,将当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入该预先训练好的预设结构深度神经网络模型中,模型将当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征分别转化为一个预设长度的特征矢量(例如,长度为1的特征矢量),再计算得到的两个特征矢量的余弦相似度,根据计算出的余弦相似度的大小确定身份验证结果,即将该余弦相似度与预设阈值(例如0.95)比较,若该余弦相似度大于预设阈值,则确定身份验证通过,反之,则确定身份验证失败。其中,余弦相似度计算公式为:cos(x i,x j)=x i Tx j,x i和x j代表两个特征矢量,T为预先确定值。 The authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame. After the feature frame is extracted from the voice frame in the group, the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and determining the identity if the cosine similarity is greater than a preset threshold The verification passes, and vice versa, it determines that the authentication failed. Wherein, the cosine similarity calculation formula is: cos(x i , x j )=x i T x j , x i and x j represent two feature vectors, and T is a predetermined value.
本实施例技术方案,通过将接收到待验证身份的目标用户的当前语音数据和待验证身份的标准语音数据先进行分帧处理,利用预设滤波器提取分帧处理得到的各个语音帧的提取出预设类型声学特征,再将提取出的预设类型声学特征输入到预先训练好的预设结构深度神经网络模型,预设结构深度神经网络模型分别将当前语音数据对应的预设类型声学特征和标准语音数据对应的预设类型声学特征转化为对应的特征向量后,计算两个特征向量的余弦相似度,根据余弦相似度大小确认验证结果。本实施例技术方案,通过将语音数据先分帧处理为多个语音帧并根据语音帧提取预设类型声学特征,使得即使在采集到的有效语音数据很短时,也能提取根据采集到的语音数据提取得到足够多的声学特征,再通过训练好的深度神经网络模型根据提取出得到声学特征进行处理,以输出验证结果,相较于现有技术而言,本方案对说话人身份验证的准确性和可靠性更高。In the technical solution of the embodiment, the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter. The preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data. After the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude. In the technical solution of the embodiment, the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected The speech data extraction obtains enough acoustic features, and then the extracted deep neural network model is processed according to the extracted acoustic features to output the verification result. Compared with the prior art, the scheme is for the speaker identity verification. More accurate and reliable.
如图5所示,图5为本申请身份验证系统二实施例的程序模块图。As shown in FIG. 5, FIG. 5 is a program module diagram of a second embodiment of the identity verification system of the present application.
本实施例中,身份验证系统还包括:In this embodiment, the identity verification system further includes:
检测模块105,用于在将当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理之前,分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。The detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.
在采集的当前语音数据和预先存储的标准语音数据中都包含一些非说话人语音部分(例如,静音或噪音),如果这些部分不删除掉,则对当前语音数据或标准对语音数据进行分帧处理后得到的语音帧组中,会出现包含非说话人语音部分的语音帧(甚至个别语音帧中全为非说话人语音),这样,利用预设滤波器根据这些包含非说话人语音部分的语音帧提取出的预设类型声学特征属于杂质特征,会降低预设结构深度神经网络模型得出结果的准确性;故 本实施例在对语音数据分帧处理之前,先检测当前语音数据和标准语音数据中的非说话人语音部分,并将检测到的非说话人语音部分删除,本实施例采用的非说话人语音部分的检测方式为活动端点检测(Voice Activity Detection,VAD)。Some non-speaker voice parts (for example, mute or noise) are included in the collected current voice data and the pre-stored standard voice data, and if the parts are not deleted, the current voice data or the standard pair voice data is framed. In the speech frame group obtained after processing, a speech frame containing a non-speaker speech portion (even all non-speaker speech in an individual speech frame) may appear, so that a predetermined filter is used according to these non-speaker speech portions. The preset type acoustic feature extracted by the speech frame belongs to the impurity feature, which reduces the accuracy of the result obtained by the preset structure depth neural network model; therefore, the present embodiment detects the current voice data and the standard before processing the voice data frame. The non-speaker voice part of the voice data is deleted, and the detected non-speaker voice part is deleted. The non-speaker voice part of the embodiment is detected by a voice activity detection (VAD).
本实施例中,所述预设结构深度神经网络模型的训练过程为(可参考图2):In this embodiment, the training process of the preset structure deep neural network model is (refer to FIG. 2):
S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
先准备好预设数量(例如,10000个)语音数据样本,各个语音数据样本都是已知说话人身份的语音数据;这些语音数据样本中,每一个说话人身份或部分的说话人身份对应有多个语音数据样本,将各个语音数据样本标注上代表对应的说话人身份的标签。First, a preset number (for example, 10000) of voice data samples are prepared, and each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
对语音数据样本进行活动端点检测,以检测出每个语音数据样本中的非说话人的语音(例如,静音或噪音)并删除,避免语音数据样本中存在与对应的说话人身份的声纹特征无关的语音数据,而影响对模型的训练效果。Performing active endpoint detection on the voice data samples to detect and delete non-speaker voices (eg, mute or noise) in each voice data sample to avoid the presence of voiceprint features in the voice data samples and corresponding speaker identity Irrelevant speech data, which affects the training effect on the model.
S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
例如,将得到的标准语音数据样本的70%作为训练集,30%作为验证集。For example, 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
其中,预设的分帧参数例如,每隔25毫秒分帧,帧移10毫秒;该预设滤波器例如为梅尔滤波器,通过梅尔滤波器提取出的预设类型声学特征为MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)频谱特征,例如,36维MFCC频谱特征。The preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds; the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
对训练集中的预设类型声学特征进行分批处理,划分成M(例如30)批,分批方式可按照语音帧组为分配单位,每一批中分配等量或不等量的语音帧组对应的预设类型声学特征;将训练集中的各个语音帧组对应的预设类型声学特征按照分成的批次逐一的输入预设结构深度神经网络模型中进行迭代训练,每一批预设类型声学特征使所述预设结构胜读神经网络模型迭代一次,每次迭代都会更新得到新的模型参数,通过多次迭代训练完成后,该预设结 构深度神经网络模型已经更新为较佳的模型参数;迭代训练完成后,则利用验证集对该预设结构深度神经网络模型的准确率进行验证,即将验证集中的标准语音数据两两分组,每次输入一个分组中的标准语音数据样本对应的预设类型声学特征到该预设结构深度神经网络模型,根据输入的两个标准语音数据的身份标签,确认输出的验证结构是否正确,在完成对各个分组的验证后,根据验证结果正确次数计算准确率,例如对100个分组进行验证,最终得到验证结果正确的有99组,则准确率就为99%。The preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches. The batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch. Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are input into the preset structure depth neural network model one by one according to the divided batches for iterative training, each batch of preset type acoustics The feature makes the preset structure win the neural network model iterative once, and each iteration is updated to obtain new model parameters. After multiple iterations training, the preset structure depth neural network model has been updated to better model parameters. After the iterative training is completed, the accuracy of the preset structure deep neural network model is verified by using the verification set, that is, the standard voice data in the verification set is grouped into two groups, and the standard voice data sample corresponding to each group is input each time. Set the type acoustic feature to the preset structure depth neural network model, based on the identity of the input two standard voice data To confirm whether the verification structure of the output is correct, after completing the verification of each group, calculate the accuracy rate according to the correct number of verification results, for example, verify 100 groups, and finally obtain 99 groups with correct verification results, then the accuracy rate is 99%.
S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
系统中预先设置了准确率的验证阈值(即所述预设阈值,例如98.5%),用于对所述预设结构深度神经网络模型的训练效果进行检验;若通过所述验证集对所述预设结构深度神经网络模型验证得到的准确率大于所述预设阈值,那么说明该预设结构深度神经网络模型的训练达到了标准,此时则结束模型训练。The verification threshold of the accuracy rate (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the preset structure deep neural network model; The accuracy of the preset structure deep neural network model verification is greater than the preset threshold, then the training of the preset structure deep neural network model reaches the standard, and the model training is ended.
S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
若是通过所述验证集对所述预设结构深度神经网络模型验证得到的准确率小于或等于所述预设阈值,那么说明该预设结构深度神经网络模型的训练还没有达到了预期标准,可能是训练集数量不够或验证集数量不够,所以,在这种情况时,则增加获取的语音数据样本的数量(例如,每次增加固定数量或每次增加随机数量),然后在这基础上,重新执行上述步骤S1-S5,如此循环执行,直至达到了步骤S6的要求,则结束模型训练。If the accuracy of the verification of the preset structure depth neural network model by the verification set is less than or equal to the preset threshold, then the training of the preset structure deep neural network model has not reached the expected standard, and Is the number of training sets insufficient or the number of verification sets is insufficient, so in this case, increase the number of acquired speech data samples (for example, increase the fixed number each time or increase the random number each time), and then, based on this, The above steps S1-S5 are re-executed, and the loop is executed until the requirement of step S6 is reached, and the model training is ended.
本实施例中,所述预设结构深度神经网络模型迭代训练的过程包括:In this embodiment, the process of the iterative training of the preset structure deep neural network model includes:
根据模型的当前参数将当前输入每个语音帧组对应的预设类型声学特征转化为对应的一个预设长度的特征矢量;Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;
从各个特征矢量中进行随机选取以获得多个三元组,第i个三元组(x i1,x i2,x i3)由三个不同的特征矢量x i1、x i2和x i3组成,其中,x i1和x i2对应同一个说话人,x i1和x i3对应不同的说话人,i为正整数; Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
采用预先确定的计算公式计算x i1和x i2之间的余弦相似度
Figure PCTCN2018102105-appb-000018
并计算x i1和x i3之间的余弦相似度
Figure PCTCN2018102105-appb-000019
Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
Figure PCTCN2018102105-appb-000018
And calculate the cosine similarity between x i1 and x i3
Figure PCTCN2018102105-appb-000019
根据余弦相似度
Figure PCTCN2018102105-appb-000020
及预先确定的损失函数L更新模型的参数,所述预先确定的损失函数L的公式为:
Figure PCTCN2018102105-appb-000021
其中α是取值范围在0.05~0.2之间常量,N是获得的三元组的个数。
Cosine similarity
Figure PCTCN2018102105-appb-000020
And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
Figure PCTCN2018102105-appb-000021
Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
其中,模型参数更新步骤为:1.采用反向传播算法计算该预设结构深度神经网络的梯度;2.采用mini-batch-SGD(小批量随机梯度下降)方法更新该预设结构深度神经网络的参数。The model parameter updating step is: 1. using a back propagation algorithm to calculate the gradient of the preset structure deep neural network; 2. using the mini-batch-SGD (small batch random gradient descent) method to update the preset structure deep neural network Parameters.
进一步地,本实施例的所述预设结构深度神经网络模型的网络结构如下:Further, the network structure of the preset structure deep neural network model of this embodiment is as follows:
第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;前向LSTM和后向LSTM分别输出一个矢量序列;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers; LSTM and backward LSTM respectively output a vector sequence;
第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
第五层:是损失层,损失函数L的公式为:
Figure PCTCN2018102105-appb-000022
其中α是取值范围在0.05~0.2之间的常量,
Figure PCTCN2018102105-appb-000023
代表属于同一说话人的两个特征矢量的余弦相似度,
Figure PCTCN2018102105-appb-000024
代表不属于同一说话人的两个特征矢量的余弦相似度。
The fifth layer: is the loss layer, the formula of the loss function L is:
Figure PCTCN2018102105-appb-000022
Where α is a constant with a value ranging from 0.05 to 0.2.
Figure PCTCN2018102105-appb-000023
Representing the cosine similarity of two eigenvectors belonging to the same speaker,
Figure PCTCN2018102105-appb-000024
Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
进一步地,本申请还提出一种计算机可读存储介质,所述计算机可读存储介质存储有身份验证系统,所述身份验证系统可被至少一个处理器执行,以使所述至少一个处理器执行上述任一实施例中的身份验证方法。Further, the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是在本申请的发明构思下,利用本申请说明书及附图内容所作的等效结构变换,或直接/间接运用在其他相关的技术领域均包括在本申请的专利保护范围内。The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the patents of the present application, and the equivalent structural transformation, or direct/indirect use, of the present application and the contents of the drawings is used in the present invention. All other related technical fields are included in the patent protection scope of the present application.

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的身份验证系统,所述身份验证系统被所述处理器执行时实现如下步骤:An electronic device, comprising: a memory and a processor, the memory storing an identity verification system operable on the processor, the identity verification system being executed by the processor Implement the following steps:
    在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
    利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
    分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;
    计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
  2. 如权利要求1所述的电子装置,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The electronic device according to claim 1, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100001
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100002
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100003
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100001
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100002
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100003
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  3. 如权利要求1所述的电子装置,其特征在于,在将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理的步骤之前,该处理器还用于执行所述身份验证系统,以实现以下步骤:The electronic device according to claim 1, wherein the processor is further configured to perform the step of performing the framing processing of the current voice data and the standard voice data according to preset framing parameters, respectively. An authentication system to implement the following steps:
    分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  4. 如权利要求3所述的电子装置,其特征在于,所述预设结构深度神经 网络模型的网络结构如下:The electronic device according to claim 3, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100004
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100005
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100006
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100004
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100005
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100006
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  5. 如权利要求1所述的电子装置,其特征在于,所述预设结构深度神经网络模型的训练过程为:The electronic device according to claim 1, wherein the training process of the preset structure deep neural network model is:
    S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
    S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
    S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
    S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
    S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
    S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
  6. 如权利要求5所述的电子装置,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The electronic device according to claim 5, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100007
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100008
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100009
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100007
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100008
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100009
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  7. 如权利要求5所述的电子装置,其特征在于,所述预设结构深度神经网络模型迭代训练的过程包括:The electronic device according to claim 5, wherein the process of iterative training of the predetermined structure depth neural network model comprises:
    根据模型的当前参数将当前输入每个语音帧组对应的预设类型声学特征转化为对应的一个预设长度的特征矢量;Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;
    从各个特征矢量中进行随机选取以获得多个三元组,第i个三元组(x i1,x i2,x i3)由三个不同的特征矢量x i1、x i2和x i3组成,其中,x i1和x i2对应同一个说话人,x i1和x i3对应不同的说话人,i为正整数; Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
    采用预先确定的计算公式计算x i1和x i2之间的余弦相似度
    Figure PCTCN2018102105-appb-100010
    并计算x i1和x i3之间的余弦相似度
    Figure PCTCN2018102105-appb-100011
    Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
    Figure PCTCN2018102105-appb-100010
    And calculate the cosine similarity between x i1 and x i3
    Figure PCTCN2018102105-appb-100011
    根据余弦相似度
    Figure PCTCN2018102105-appb-100012
    及预先确定的损失函数L更新模型的参数,所述预先确定的损失函数L的公式为:
    Figure PCTCN2018102105-appb-100013
    其中α是取值范围在0.05~0.2之间常量,N是获得的三元组的个数。
    Cosine similarity
    Figure PCTCN2018102105-appb-100012
    And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
    Figure PCTCN2018102105-appb-100013
    Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
  8. 如权利要求7所述的电子装置,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The electronic device according to claim 7, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100014
    其中 α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100015
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100016
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100014
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100015
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100016
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  9. 一种身份验证方法,其特征在于,该身份验证方法包括:An authentication method, characterized in that the identity verification method comprises:
    在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
    利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
    分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;
    计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
  10. 如权利要求9所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The identity verification method according to claim 9, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100017
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100018
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100019
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100017
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100018
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100019
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  11. 如权利要求9所述的身份验证方法,其特征在于,在将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理的步骤之前,所述身份验证方法还包括步骤:The identity verification method according to claim 9, wherein the identity verification method further comprises the step of performing the step of framing the current voice data and the standard voice data according to preset framing parameters, respectively. :
    分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  12. 如权利要求11所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The identity verification method according to claim 11, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100020
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100021
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100022
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100020
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100021
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100022
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  13. 如权利要求9所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型的训练过程为:The identity verification method according to claim 9, wherein the training process of the preset structure deep neural network model is:
    S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
    S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
    S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
    S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
    S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
    S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and re-execute the above steps S1-S5 based on the added voice data samples.
  14. 如权利要求13所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The identity verification method according to claim 13, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它 将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100023
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100024
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100025
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100023
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100024
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100025
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  15. 如权利要求13所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型迭代训练的过程包括:The identity verification method according to claim 13, wherein the process of iterative training of the preset structure deep neural network model comprises:
    根据模型的当前参数将当前输入每个语音帧组对应的预设类型声学特征转化为对应的一个预设长度的特征矢量;Converting a preset type acoustic feature corresponding to each voice frame group currently input into a corresponding preset length feature vector according to a current parameter of the model;
    从各个特征矢量中进行随机选取以获得多个三元组,第i个三元组(x i1,x i2,x i3)由三个不同的特征矢量x i1、x i2和x i3组成,其中,x i1和x i2对应同一个说话人,x i1和x i3对应不同的说话人,i为正整数; Randomly selecting from each feature vector to obtain a plurality of triplets, the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
    采用预先确定的计算公式计算x i1和x i2之间的余弦相似度
    Figure PCTCN2018102105-appb-100026
    并计算x i1和x i3之间的余弦相似度
    Figure PCTCN2018102105-appb-100027
    Calculate the cosine similarity between x i1 and x i2 using a predetermined calculation formula
    Figure PCTCN2018102105-appb-100026
    And calculate the cosine similarity between x i1 and x i3
    Figure PCTCN2018102105-appb-100027
    根据余弦相似度
    Figure PCTCN2018102105-appb-100028
    及预先确定的损失函数L更新模型的参数,所述预先确定的损失函数L的公式为:
    Figure PCTCN2018102105-appb-100029
    其中α是取值范围在0.05~0.2之间常量,N是获得的三元组的个数。
    Cosine similarity
    Figure PCTCN2018102105-appb-100028
    And a predetermined loss function L updates the parameters of the model, the formula of the predetermined loss function L is:
    Figure PCTCN2018102105-appb-100029
    Where α is a constant ranging from 0.05 to 0.2, and N is the number of triples obtained.
  16. 如权利要求15所述的身份验证方法,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The identity verification method according to claim 15, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100030
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100031
    代表属于同一说话人的两个特征矢 量的余弦相似度,
    Figure PCTCN2018102105-appb-100032
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100030
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100031
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100032
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有身份验证系统,所述身份验证系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer readable storage medium, characterized in that the computer readable storage medium stores an identity verification system, the identity verification system being executable by at least one processor to cause the at least one processor to perform the following steps:
    在收到待进行身份验证的目标用户的当前语音数据后,从数据库中获取待验证的身份对应的标准语音数据,将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理,以得到所述当前语音数据对应的当前语音帧组和所述标准语音数据对应的标准语音帧组;After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
    利用预设滤波器分别提取出当前语音帧组中各个语音帧的预设类型声学特征和标准语音帧组中各个语音帧的预设类型声学特征;Presetting a predetermined type of acoustic feature of each speech frame in the current speech frame group and a preset type acoustic feature of each speech frame in the standard speech frame group by using a preset filter;
    分别将提取出的当前语音帧组对应的预设类型声学特征和标准语音帧组对应的预设类型声学特征输入预先训练好的预设结构深度神经网络模型,以得到所述当前语音数据和所述标准语音数据各自对应的预设长度的特征矢量;And inputting the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure deep neural network model, respectively, to obtain the current voice data and the a feature vector of a preset length corresponding to each of the standard voice data;
    计算得到的两个特征矢量的余弦相似度,并根据计算出的余弦相似度大小确定身份验证结果,所述身份验证结果包括验证通过结果和验证失败结果。Calculating the cosine similarity of the two feature vectors, and determining an identity verification result according to the calculated magnitude of the cosine similarity, the identity verification result including a verification pass result and a verification failure result.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,在将所述当前语音数据和标准语音数据分别按照预设的分帧参数进行分帧处理的步骤之前,所述身份验证方法还包括步骤:The computer readable storage medium according to claim 17, wherein said identity verification method further comprises the step of performing frame processing on said current voice data and said standard voice data according to preset framing parameters, respectively Including steps:
    分别对所述当前语音数据和标准语音数据进行活动端点检测,将所述当前语音数据和所述标准语音数据中的非说话人的语音删除。Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设结构深度神经网络模型的训练过程为:The computer readable storage medium of claim 17, wherein the training process of the predetermined structure depth neural network model is:
    S1、获取预设数量语音数据样本,对各个语音数据样本分别标注代表对应的说话人身份的标签;S1: acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
    S2、分别对每个语音数据样本进行活动端点检测,将语音数据样本中非说话人的语音删除,得到预设数量的标准语音数据样本;S2: performing active endpoint detection on each voice data sample, and deleting non-speaker voice in the voice data sample to obtain a preset number of standard voice data samples;
    S3、将得到的标准语音数据样本的第一百分比作为训练集,第二百分比作为验证集,所述第一百分比与第二百分比的和小于等于100%;S3, the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    S4、将所述训练集和验证集中的各个标准语音数据样本按照预设的分帧参数分别进行分帧处理,以获得每个标准语音数据样本对应的语音帧组,再利用预设滤波器分别提取出每个语音帧组中的各个语音帧的预设类型声学特征;S4. Perform, respectively, each standard voice data sample in the training set and the verification set according to a preset framing parameter to obtain a voice frame group corresponding to each standard voice data sample, and then use a preset filter separately. Extracting preset type acoustic features of each of the speech frames in each of the speech frame groups;
    S5、将所述训练集中的各个语音帧组对应的预设类型声学特征划分成M批,分批输入所述预设结构深度神经网络模型中进行迭代训练,并在所述预设结构深度神经网络模型训练完成后,采用验证集对所述预设结构深度神经网络模型的准确率进行验证;S5. Divide the preset type acoustic features corresponding to each voice frame group in the training set into M batches, input the batch structure into the preset structure depth neural network model for iterative training, and perform deep neuralization in the preset structure. After the network model training is completed, the verification set is used to verify the accuracy of the preset structure deep neural network model;
    S6、若验证得到的准确率大于预设阈值,则模型训练结束;S6. If the accuracy obtained by the verification is greater than a preset threshold, the model training ends.
    S7、若验证得到的准确率小于或者等于预设阈值,则增加获取的语音数 据样本的数量,并基于增加后的语音数据样本重新执行上述步骤S1-S5。S7. If the accuracy obtained by the verification is less than or equal to the preset threshold, increase the number of acquired voice data samples, and perform the above steps S1-S5 again based on the added voice data samples.
  20. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设结构深度神经网络模型的网络结构如下:The computer readable storage medium of claim 17, wherein the network structure of the preset structure deep neural network model is as follows:
    第一层:是数层堆叠的有相同结构的神经网络层,其中,每层神经网络采用并列的一个前向长短期记忆网络LSTM和一个后向LSTM,层数为1~3层;The first layer: is a layered neural network layer with the same structure, wherein each layer of neural network adopts a forward long-term short-term memory network LSTM and a backward LSTM, and the number of layers is 1-3 layers;
    第二层:是平均层,此层的作用是沿时间轴向对矢量序列求平均值,它将上一层前向LSTM和后向LSTM输出的矢量序列都进行平均化,得到一个前向平均矢量和一个后向平均矢量,并将这两个平均矢量前后串联成一个矢量;The second layer: is the average layer, the function of this layer is to average the vector sequence along the time axis, and average the vector sequence of the previous forward LSTM and backward LSTM output to obtain a forward average. Vector and a backward average vector, and the two average vectors are connected in series to form a vector;
    第三层:是深度神经网络DNN全连接层;The third layer: is the deep neural network DNN fully connected layer;
    第四层:是归一化层,此层将上一层的输入按照L2范数进行归一化,得到长度为1的归一化后的特征矢量;The fourth layer: is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
    第五层:是损失层,损失函数L的公式为:
    Figure PCTCN2018102105-appb-100033
    其中α是取值范围在0.05~0.2之间的常量,
    Figure PCTCN2018102105-appb-100034
    代表属于同一说话人的两个特征矢量的余弦相似度,
    Figure PCTCN2018102105-appb-100035
    代表不属于同一说话人的两个特征矢量的余弦相似度。
    The fifth layer: is the loss layer, the formula of the loss function L is:
    Figure PCTCN2018102105-appb-100033
    Where α is a constant with a value ranging from 0.05 to 0.2.
    Figure PCTCN2018102105-appb-100034
    Representing the cosine similarity of two eigenvectors belonging to the same speaker,
    Figure PCTCN2018102105-appb-100035
    Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
PCT/CN2018/102105 2018-03-19 2018-08-24 Electronic device, identity verification method and computer-readable storage medium WO2019179029A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810225887.2A CN108564955B (en) 2018-03-19 2018-03-19 Electronic device, auth method and computer readable storage medium
CN201810225887.2 2018-03-19

Publications (1)

Publication Number Publication Date
WO2019179029A1 true WO2019179029A1 (en) 2019-09-26

Family

ID=63532742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102105 WO2019179029A1 (en) 2018-03-19 2018-08-24 Electronic device, identity verification method and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN108564955B (en)
WO (1) WO2019179029A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (en) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 Dialect recognition model training method, readable storage medium and terminal device

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564954B (en) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 Deep neural network model, electronic device, identity verification method, and storage medium
CN110289003B (en) * 2018-10-10 2021-10-29 腾讯科技(深圳)有限公司 Voiceprint recognition method, model training method and server
CN109346086A (en) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
US10887317B2 (en) * 2018-11-28 2021-01-05 Sap Se Progressive authentication security adapter
CN110148402A (en) * 2019-05-07 2019-08-20 平安科技(深圳)有限公司 Method of speech processing, device, computer equipment and storage medium
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment
CN111933153B (en) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 Voice segmentation point determining method and device
CN112016673A (en) * 2020-07-24 2020-12-01 浙江工业大学 Mobile equipment user authentication method and device based on optimized LSTM
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112347788A (en) * 2020-11-06 2021-02-09 平安消费金融有限公司 Corpus processing method, apparatus and storage medium
CN113178197B (en) * 2021-04-27 2024-01-09 平安科技(深圳)有限公司 Training method and device of voice verification model and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN106205624A (en) * 2016-07-15 2016-12-07 河海大学 A kind of method for recognizing sound-groove based on DBSCAN algorithm
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN106205624A (en) * 2016-07-15 2016-12-07 河海大学 A kind of method for recognizing sound-groove based on DBSCAN algorithm
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (en) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 Dialect recognition model training method, readable storage medium and terminal device

Also Published As

Publication number Publication date
CN108564955B (en) 2019-09-03
CN108564955A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
WO2019179029A1 (en) Electronic device, identity verification method and computer-readable storage medium
WO2019179036A1 (en) Deep neural network model, electronic device, identity authentication method, and storage medium
JP6429945B2 (en) Method and apparatus for processing audio data
US6490560B1 (en) Method and system for non-intrusive speaker verification using behavior models
WO2019119505A1 (en) Face recognition method and device, computer device and storage medium
US7689418B2 (en) Method and system for non-intrusive speaker verification using behavior models
US10650379B2 (en) Method and system for validating personalized account identifiers using biometric authentication and self-learning algorithms
US11482050B2 (en) Intelligent gallery management for biometrics
CN108989349B (en) User account unlocking method and device, computer equipment and storage medium
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
US11062120B2 (en) High speed reference point independent database filtering for fingerprint identification
KR20180082948A (en) Method and apparatus for authenticating a user using an electrocardiogram signal
US20230012235A1 (en) Using an enrolled biometric dataset to detect adversarial examples in biometrics-based authentication system
EP1470549B1 (en) Method and system for non-intrusive speaker verification using behavior models
WO2023134232A1 (en) Method, apparatus and device for updating feature vector database, and medium
CN116561737A (en) Password validity detection method based on user behavior base line and related equipment thereof
CN113035230A (en) Authentication model training method and device and electronic equipment
WO2023078115A1 (en) Information verification method, and server and storage medium
CN117373082A (en) Face recognition method, device, equipment and storage medium
CN111261155A (en) Speech processing method, computer-readable storage medium, computer program, and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18910797

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18910797

Country of ref document: EP

Kind code of ref document: A1