WO2019179036A1 - Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage - Google Patents

Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage Download PDF

Info

Publication number
WO2019179036A1
WO2019179036A1 PCT/CN2018/102218 CN2018102218W WO2019179036A1 WO 2019179036 A1 WO2019179036 A1 WO 2019179036A1 CN 2018102218 W CN2018102218 W CN 2018102218W WO 2019179036 A1 WO2019179036 A1 WO 2019179036A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
preset
standard
neural network
current
Prior art date
Application number
PCT/CN2018/102218
Other languages
English (en)
Chinese (zh)
Inventor
赵峰
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019179036A1 publication Critical patent/WO2019179036A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application relates to the field of voiceprint recognition technology, and in particular, to a deep neural network model, an electronic device, an authentication method, and a storage medium.
  • Speaker recognition commonly referred to as voiceprint recognition, is a type of biometric technology that is often used to confirm whether a certain segment of speech is spoken by a designated person and is a "one-to-one discrimination" problem. Speaker recognition is widely used in many fields, for example, in the fields of finance, securities, social security, public security, military and other civil safety certifications.
  • Speaker recognition includes text-related recognition and text-independent recognition.
  • text-independent speaker recognition technology has continuously broken through, and its accuracy has been greatly improved compared with the past.
  • the existing text-independent speaker recognition technology is not accurate and error-prone.
  • the main purpose of the present application is to provide a deep neural network model, an electronic device, an authentication method, and a storage medium, which are intended to improve the accuracy of speaker authentication.
  • the first aspect of the present application provides a deep neural network model, including:
  • the first layer structure is a neural network layer having the same preset structure stacked by multiple layers, and the neural network layer of each preset structure includes: two concatenated CNN convolution layers, two modified linear units ReLU, and one Two directly connected CNN convolution layers are directly connected to the direct connection operation X, wherein each ReLU is in one-to-one correspondence with each CNN convolution layer, and each ReLU is connected in series after the corresponding CNN convolution layer, the straight
  • the operation X adds the input of the convolution operation of the first CNN convolution layer of the two concatenated CNN convolution layers to the output of the convolution operation of the second CNN convolution layer, and sends the result to the Two CNN convolutional layers corresponding to the ReLU operation;
  • the second layer structure is an averaging layer, the function of which is to average the vector sequence along the time axis, and average the two-dimensional vector sequence outputted by the first layer structure;
  • the third layer structure is the DNN fully connected layer
  • the fourth layer structure is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer structure is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • a second aspect of the present application provides an electronic device including a memory and a processor, the memory storing an identity verification system executable on the processor, the identity verification system being The following steps are implemented during execution:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the preset structure depth neural network model is the deep neural network model provided by the first aspect of the application
  • the third aspect of the present application provides an identity verification method, where the identity verification method includes:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the preset structure depth neural network model is the deep neural network model provided by the first aspect of the application
  • a fourth aspect of the present application provides a computer readable storage medium storing an identity verification system, the identity verification system being executable by at least one processor to cause the at least one processor to perform the following step:
  • the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively classified according to preset framing parameters. Performing frame processing to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the preset structure depth neural network model is the deep neural network model provided by the first aspect of the application
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first subjected to frame processing, and the extracted voice frames extracted by the frame processing are extracted by using a preset filter.
  • Presetting the type of acoustic features and then inputting the extracted preset type acoustic features into the pre-trained preset structure depth neural network model, and the preset structure depth neural network model respectively respectively preset the preset type acoustic features corresponding to the current voice data
  • the preset type acoustic characteristics corresponding to the standard speech data are converted into corresponding feature vectors
  • the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity size.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the deep neural network model of the present application is used to process the extracted acoustic features, which can significantly enhance the model's feature extraction ability on the input data, and reduce the risk of performance degradation when the network level is deepened. Improve the accuracy of the output verification results.
  • FIG. 1 is a schematic structural diagram of a neural network layer of a preset structure of a first layer structure in a preferred embodiment of the deep neural network model of the present application;
  • FIG. 2 is a schematic flow chart of a deep neural network model training process according to the present application.
  • FIG. 3 is a schematic flowchart of an embodiment of an identity verification method according to the present application.
  • FIG. 4 is a schematic diagram of an operating environment of an embodiment of an identity verification system according to the present application.
  • FIG. 5 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
  • FIG. 6 is a block diagram of a program of an embodiment of an identity verification system according to the present application.
  • the present application proposes a deep neural network model for speaker identification verification.
  • the first layer structure is a neural network layer having the same preset structure by a multi-layer stack (for example, 9-12 layer stacks).
  • the neural network layer of each preset structure includes: two serial connections.
  • CNN convolutional layer 100 for example, the CNN convolutional layer (conv) 100 may adopt: a 3*3 convolution kernel, a step size of 1*1, a channel number of 64), and two modified linear units (ReLU) 200, and a direct connection operation X of directly connecting two concatenated CNN convolutional layers 100, wherein each ReLU 200 is in one-to-one correspondence with each CNN convolutional layer 100, and each ReLU 200 is connected in series to a corresponding CNN convolution After the layer 100, the direct connection operation X outputs the input of the convolution operation of the convolution operation of the first CNN convolutional layer 100 of the two concatenated CNN convolutional layers 100 and the convolution operation of the second CNN convolutional layer 100. Adding and feeding the result to the ReLU200 operation corresponding to the second
  • the second layer structure is an averaging layer, the function of which is to average the vector sequence along the time axis, and average the two-dimensional vector sequence outputted by the first layer structure;
  • the third layer structure is the DNN fully connected layer
  • the fourth layer structure is a normalization layer, which normalizes the input of the previous layer according to the L2 norm to obtain a normalized feature vector of length 1;
  • the fifth layer structure is the loss layer, the formula of the loss function L is: Where ⁇ is a constant with a value ranging from 0.05 to 0.2. Representing the cosine similarity of two eigenvectors belonging to the same speaker, Represents the cosine similarity of two feature vectors that do not belong to the same speaker.
  • the model can significantly enhance the feature extraction capability of the input data, and reduce the risk of performance degradation when the network level is deepened.
  • S1 acquiring a preset number of voice data samples, and labeling each voice data sample with a label representing a corresponding speaker identity;
  • each voice data sample is voice data of a known speaker identity; in each of the voice data samples, each speaker identity or part of the speaker identity corresponds to A plurality of voice data samples, each of which is labeled with a label representing the identity of the corresponding speaker.
  • non-speaker voices eg, mute or noise
  • the first percentage of the obtained standard voice data sample is used as a training set, and the second percentage is used as a verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • 70% of the resulting standard speech data samples are used as training sets and 30% are used as validation sets.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds;
  • the preset filter is, for example, a mel filter, and the preset type acoustic feature extracted by the mel filter is MFCC ( Mel Frequency Cepstrum Coefficient, spectral characteristics, for example, 36-dimensional MFCC spectral features.
  • the preset type acoustic features corresponding to each voice frame group in the training set are divided into M batches, and the depth neural network model is input into the iterative training in batches, and after the deep neural network model training is completed, Verifying the accuracy of the deep neural network model using a verification set;
  • the preset types of acoustic features in the training set are processed in batches and divided into M (for example, 30) batches.
  • the batch mode can be assigned according to the voice frame group, and equal or unequal number of voice frame groups are allocated in each batch.
  • Corresponding preset type acoustic features; the preset type acoustic features corresponding to each speech frame group in the training set are iteratively trained according to the divided batch input depth neural network model, and each batch of preset type acoustic features
  • the preset structure is better than the neural network model, and each iteration is updated to obtain new model parameters.
  • the verification set is used to verify the accuracy of the deep neural network model, that is, the standard speech data in the verification set is grouped into two groups, and the preset type acoustic characteristics corresponding to the standard speech data samples in one group are input to the deep neural network each time.
  • the model confirms whether the verification structure of the output is correct according to the identity tags of the two standard voice data input. After completion of the verification of each packet is calculated based on the verification result is correct accuracy of times, for example, 100 packets to verify the final results verified the correct group 99, the accuracy rate is 99%.
  • a verification threshold of accuracy (ie, the preset threshold, for example, 98.5%) is preset in the system for verifying the training effect of the deep neural network model; if the depth neural network is used by the verification set The accuracy of the model verification is greater than the preset threshold, which indicates that the training of the deep neural network model reaches the standard, and the model training is ended.
  • the accuracy of the verification of the deep neural network model by the verification set is less than or equal to the preset threshold, it indicates that the training of the deep neural network model has not reached the expected standard, and the number of training sets may not be sufficient or The number of verification sets is not enough, so in this case, increase the number of acquired speech data samples (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above steps S1- on this basis. S5, the loop is executed in this way until the requirement of step S6 is reached, and the model training is ended.
  • the process of the iterative training of the deep neural network model includes:
  • the i-th triplet (x i1 , x i2 , x i3 ) is composed of three different feature vectors x i1 , x i2 and x i3 , wherein , x i1 and x i2 correspond to the same speaker, x i1 and x i3 correspond to different speakers, and i is a positive integer;
  • is a constant ranging from 0.05 to 0.2
  • N is the number of triples obtained.
  • the model parameter updating step is: 1. calculating the gradient of the deep neural network by using a back propagation algorithm; 2. updating the parameters of the deep neural network by using a mini-batch-SGD (ie, a small batch random gradient descent) method.
  • a mini-batch-SGD ie, a small batch random gradient descent
  • the present application also proposes an identity verification method based on the deep neural network model described in any of the above embodiments.
  • FIG. 3 is a schematic flowchart of an embodiment of an identity verification method according to an application.
  • the identity verification method includes:
  • Step S10 After receiving the current voice data of the target user to be authenticated, the standard voice data corresponding to the identity to be verified is obtained from the database, and the current voice data and the standard voice data are respectively according to preset framing. Performing a framing process to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • Step S20 extracting, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • Step S30 input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the current voice data.
  • a preset length feature vector corresponding to each of the standard voice data wherein the preset structure depth neural network model is the deep neural network model described in the foregoing embodiment;
  • Step S40 Calculate the cosine similarity of the two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the deep neural network model of the present application is used to process the extracted acoustic features, which can significantly enhance the model's feature extraction ability on the input data, and reduce the risk of performance degradation when the network level is deepened. Improve the accuracy of the output verification results.
  • the method further includes the following steps:
  • Active endpoint detection is performed on the current voice data and the standard voice data, respectively, and the non-speaker voice in the current voice data and the standard voice data is deleted.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the present application also proposes an identity verification system.
  • FIG. 4 is a schematic diagram of an operating environment of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 is installed and operates in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 4 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is used to store application software and various types of data installed in the electronic device 1, such as program codes of the authentication system 10, and the like.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing an authentication system. 10 and so on.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
  • the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • FIG. 5 is a program module diagram of a preferred embodiment of the identity verification system 10 of the present application.
  • the identity verification system 10 can be partitioned into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
  • the authentication system 10 can be partitioned into a component frame module 101, an extraction module 102, a calculation module 103, and a result determination module 104.
  • a module referred to in this application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the identity verification system 10 in the electronic device 1, wherein:
  • the framing module 101 is configured to: after receiving the current voice data of the target user to be authenticated, obtain standard voice data corresponding to the identity to be verified from the database, and respectively compare the current voice data and the standard voice data according to the preset Performing a framing process on the framing parameter to obtain a current voice frame group corresponding to the current voice data and a standard voice frame group corresponding to the standard voice data;
  • the identity verification system pre-stores the standard voice data of each identity, and after receiving the current voice data of the target user to be authenticated, according to the identity verified by the target user (identity to be verified), the identity verification system Obtaining standard voice data corresponding to the identity to be verified in the database, and then performing frame processing on the received current voice data and the obtained standard voice data according to preset framing parameters, respectively, to obtain the current voice.
  • the current voice frame group corresponding to the data including a plurality of voice frames obtained by dividing the current voice data
  • the standard voice frame group corresponding to the standard voice data including a plurality of voice frames obtained by dividing the standard voice data.
  • the preset framing parameter is, for example, framed every 25 milliseconds, and the frame is shifted by 10 milliseconds.
  • the extracting module 102 is configured to separately extract, by using a preset filter, a preset type acoustic feature of each voice frame in the current voice frame group and a preset type acoustic feature of each voice frame in the standard voice frame group;
  • the identity verification system After obtaining the current voice frame group and the standard voice frame group, the identity verification system performs feature extraction on each voice frame in the current voice frame group and the standard voice frame group by using a preset filter to extract the current voice frame group.
  • the preset filter is a Mel filter
  • the extracted preset type acoustic feature is a 36-dimensional MFCC (Mel Frequency Cepstrum Coefficient) spectral feature.
  • the calculating module 103 is configured to input the preset type acoustic feature corresponding to the extracted current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group into the pre-trained preset structure depth neural network model to obtain the a preset length feature vector corresponding to each of the current voice data and the standard voice data, wherein the preset structure depth neural network model is the deep neural network model described in the foregoing embodiment;
  • the result determining module 104 is configured to calculate a cosine similarity of the obtained two feature vectors, and determine an identity verification result according to the calculated cosine similarity size, where the identity verification result includes a verification pass result and a verification failure result.
  • the authentication system has a pre-trained preset structure deep neural network model, which is an iterative training model using corresponding preset type acoustic features of the sample speech data; the identity verification system is in the current speech frame group and the standard speech frame.
  • the preset type acoustic feature corresponding to the current voice frame group and the preset type acoustic feature corresponding to the standard voice frame group are input into the pre-trained preset structure depth neural network model, and the model Converting the preset type acoustic features corresponding to the current speech frame group and the preset type acoustic features corresponding to the standard speech frame group into a preset length feature vector (for example, a feature vector of length 1), and then calculating the two The cosine similarity of the feature vectors, and determining the identity verification result according to the calculated magnitude of the cosine similarity, that is, comparing the cosine similarity with a preset threshold (for example, 0.95), and
  • the current voice data of the target user that receives the identity to be verified and the standard voice data of the identity to be verified are first framed, and the extraction of each voice frame obtained by the frame processing is extracted by using a preset filter.
  • the preset type acoustic features are input, and the extracted preset type acoustic features are input to the pre-trained preset structure deep neural network model, and the preset structure depth neural network model respectively sets the preset type acoustic characteristics corresponding to the current voice data.
  • the preset type acoustic features corresponding to the standard voice data are converted into corresponding feature vectors, the cosine similarity of the two feature vectors is calculated, and the verification result is confirmed according to the cosine similarity magnitude.
  • the voice data is first framed into a plurality of voice frames and the preset type acoustic features are extracted according to the voice frames, so that even when the collected valid voice data is short, the data can be extracted according to the collected
  • the speech data extraction obtains enough acoustic features, and then the deep neural network model of the present application is used to process the extracted acoustic features, which can significantly enhance the model's feature extraction ability on the input data, and reduce the risk of performance degradation when the network level is deepened. Improve the accuracy of the output verification results.
  • FIG. 6 is a program module diagram of a second embodiment of the identity verification system of the present application.
  • the identity verification system further includes:
  • the detecting module 105 is configured to perform active endpoint detection on the current voice data and the standard voice data before performing the framing processing on the current voice data and the standard voice data according to the preset framing parameters, respectively, and the current voice is detected. Data and non-speaker speech deletion in the standard voice data.
  • non-speaker voice parts for example, mute or noise
  • VAD voice activity detection
  • the present application further provides a computer readable storage medium storing an identity verification system executable by at least one processor to cause the at least one processor to execute The authentication method in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Collating Specific Patterns (AREA)

Abstract

La présente invention concerne un modèle de réseau neuronal profond, un dispositif électronique, un procédé d'authentification d'identité, et un support de stockage. Le procédé comprend les étapes consistant à : obtenir des données vocales standard correspondant à une identité à authentifier après que des données vocales actuelles d'un utilisateur cible à soumettre à une authentification d'identité sont reçues, et effectuer un traitement de mise en trame sur les données vocales actuelles et les données vocales standard selon un paramètre de mise en trame prédéfini afin d'obtenir un ensemble de trames vocales actuelles et un ensemble de trames vocales standard (S10); utiliser un filtre prédéfini pour extraire séparément des types prédéfinis de caractéristiques acoustiques de chaque trame vocale dans les deux ensembles de trames vocales (S20); entrer les types prédéfinis extraits de caractéristiques acoustiques dans un modèle de réseau neuronal profond préentraîné avec une structure prédéfinie pour obtenir des vecteurs de caractéristiques ayant une longueur prédéfinie et correspondant respectivement aux données vocales actuelles et aux données vocales standard (S30); et calculer la similarité cosinus des deux vecteurs de caractéristiques obtenus, et déterminer un résultat d'authentification d'identité selon la similarité cosinus calculée (S40). Selon le procédé, la précision de l'authentification d'identité pour des locuteurs est améliorée.
PCT/CN2018/102218 2018-03-19 2018-08-24 Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage WO2019179036A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810225142.6A CN108564954B (zh) 2018-03-19 2018-03-19 深度神经网络模型、电子装置、身份验证方法和存储介质
CN201810225142.6 2018-03-19

Publications (1)

Publication Number Publication Date
WO2019179036A1 true WO2019179036A1 (fr) 2019-09-26

Family

ID=63531700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102218 WO2019179036A1 (fr) 2018-03-19 2018-08-24 Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage

Country Status (2)

Country Link
CN (1) CN108564954B (fr)
WO (1) WO2019179036A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402130A (zh) * 2020-02-21 2020-07-10 华为技术有限公司 数据处理方法和数据处理装置
CN113160850A (zh) * 2021-04-27 2021-07-23 广州国音智能科技有限公司 一种基于重参数化的解耦方式的音频特征提取方法及装置
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment
CN118380098A (zh) * 2024-06-21 2024-07-23 绵阳市第三人民医院 一种术后护理方案生成方法及系统

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164452B (zh) * 2018-10-10 2023-03-10 腾讯科技(深圳)有限公司 一种声纹识别的方法、模型训练的方法以及服务器
CN109473105A (zh) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 与文本无关的声纹验证方法、装置和计算机设备
CN109408626B (zh) * 2018-11-09 2021-09-21 思必驰科技股份有限公司 对自然语言进行处理的方法及装置
CN109243466A (zh) * 2018-11-12 2019-01-18 成都傅立叶电子科技有限公司 一种声纹鉴权训练方法及系统
CN109903774A (zh) * 2019-04-12 2019-06-18 南京大学 一种基于角度间隔损失函数的声纹识别方法
CN110148402A (zh) * 2019-05-07 2019-08-20 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质
CN110265065B (zh) * 2019-05-13 2021-08-03 厦门亿联网络技术股份有限公司 一种构建语音端点检测模型的方法及语音端点检测系统
CN110197657B (zh) * 2019-05-22 2022-03-11 大连海事大学 一种基于余弦相似度的动态音声特征提取方法
CN110310628B (zh) * 2019-06-27 2022-05-20 百度在线网络技术(北京)有限公司 唤醒模型的优化方法、装置、设备及存储介质
CN110767239A (zh) * 2019-09-20 2020-02-07 平安科技(深圳)有限公司 一种基于深度学习的声纹识别方法、装置及设备
CN110992940B (zh) * 2019-11-25 2021-06-15 百度在线网络技术(北京)有限公司 语音交互的方法、装置、设备和计算机可读存储介质
CN111933153B (zh) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 一种语音分割点的确定方法和装置
CN112309365B (zh) * 2020-10-21 2024-05-10 北京大米科技有限公司 语音合成模型的训练方法、装置、存储介质以及电子设备
CN112071322B (zh) * 2020-10-30 2022-01-25 北京快鱼电子股份公司 一种端到端的声纹识别方法、装置、存储介质及设备
CN113178197B (zh) * 2021-04-27 2024-01-09 平安科技(深圳)有限公司 语音验证模型的训练方法、装置以及计算机设备
CN113705671B (zh) * 2021-08-27 2023-08-29 厦门大学 一种基于文本相关信息感知的说话人识别方法与系统

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN106340309A (zh) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 一种基于深度学习的狗叫情感识别方法及装置
CN106782602A (zh) * 2016-12-01 2017-05-31 南京邮电大学 基于长短时间记忆网络和卷积神经网络的语音情感识别方法
CN106816147A (zh) * 2017-01-25 2017-06-09 上海交通大学 基于二值神经网络声学模型的语音识别系统
CN106920544A (zh) * 2017-03-17 2017-07-04 深圳市唯特视科技有限公司 一种基于深度神经网络特征训练的语音识别方法
CN106952649A (zh) * 2017-05-14 2017-07-14 北京工业大学 基于卷积神经网络和频谱图的说话人识别方法
CN106991999A (zh) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 语音识别方法及装置
CN107408384A (zh) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 部署的端对端语音识别
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN107705806A (zh) * 2017-08-22 2018-02-16 北京联合大学 一种使用谱图和深卷积神经网络进行语音情感识别的方法
CN108461085A (zh) * 2018-03-13 2018-08-28 南京邮电大学 一种短时语音条件下的说话人识别方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105261358A (zh) * 2014-07-17 2016-01-20 中国科学院声学研究所 用于语音识别的n元文法模型构造方法及语音识别系统
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN107808659A (zh) * 2017-12-02 2018-03-16 宫文峰 智能语音信号模式识别系统装置
CN108564955B (zh) * 2018-03-19 2019-09-03 平安科技(深圳)有限公司 电子装置、身份验证方法和计算机可读存储介质

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107408384A (zh) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 部署的端对端语音识别
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN106340309A (zh) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 一种基于深度学习的狗叫情感识别方法及装置
CN106782602A (zh) * 2016-12-01 2017-05-31 南京邮电大学 基于长短时间记忆网络和卷积神经网络的语音情感识别方法
CN106816147A (zh) * 2017-01-25 2017-06-09 上海交通大学 基于二值神经网络声学模型的语音识别系统
CN106920544A (zh) * 2017-03-17 2017-07-04 深圳市唯特视科技有限公司 一种基于深度神经网络特征训练的语音识别方法
CN106991999A (zh) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 语音识别方法及装置
CN106952649A (zh) * 2017-05-14 2017-07-14 北京工业大学 基于卷积神经网络和频谱图的说话人识别方法
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN107705806A (zh) * 2017-08-22 2018-02-16 北京联合大学 一种使用谱图和深卷积神经网络进行语音情感识别的方法
CN108461085A (zh) * 2018-03-13 2018-08-28 南京邮电大学 一种短时语音条件下的说话人识别方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment
CN111402130A (zh) * 2020-02-21 2020-07-10 华为技术有限公司 数据处理方法和数据处理装置
CN113160850A (zh) * 2021-04-27 2021-07-23 广州国音智能科技有限公司 一种基于重参数化的解耦方式的音频特征提取方法及装置
CN118380098A (zh) * 2024-06-21 2024-07-23 绵阳市第三人民医院 一种术后护理方案生成方法及系统

Also Published As

Publication number Publication date
CN108564954A (zh) 2018-09-21
CN108564954B (zh) 2020-01-10

Similar Documents

Publication Publication Date Title
WO2019179036A1 (fr) Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage
WO2019179029A1 (fr) Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur
WO2019119505A1 (fr) Procédé et dispositif de reconnaissance faciale, dispositif informatique et support d'enregistrement
US11055395B2 (en) Step-up authentication
WO2017215558A1 (fr) Procédé et dispositif de reconnaissance d'empreinte vocale
US10685008B1 (en) Feature embeddings with relative locality for fast profiling of users on streaming data
US20140343943A1 (en) Systems, Computer Medium and Computer-Implemented Methods for Authenticating Users Using Voice Streams
WO2019174131A1 (fr) Procédé d'authentification d'identité, serveur, et support de stockage lisible par ordinateur
WO2019196303A1 (fr) Procédé d'authentification d'identité d'utilisateur, serveur et support de stockage
US11062120B2 (en) High speed reference point independent database filtering for fingerprint identification
US20140007210A1 (en) High security biometric authentication system
WO2019136911A1 (fr) Procédé et appareil de reconnaissance vocale, dispositif terminal et support de stockage
US9483682B1 (en) Fingerprint recognition method and device thereof
US10089349B2 (en) Method and electronic device for updating the registered fingerprint datasets of fingerprint recognition
WO2022142032A1 (fr) Procédé et appareil de vérification de signature manuscrite, dispositif informatique et support de stockage
US20240187406A1 (en) Context-based authentication of a user
US12069047B2 (en) Using an enrolled biometric dataset to detect adversarial examples in biometrics-based authentication system
WO2017156963A1 (fr) Terminal et procédé destinés au déverrouillage par empreinte digitale
WO2020024415A1 (fr) Procédé et appareil de traitement de reconnaissance d'empreinte vocale, dispositif électronique et support de stockage
CN116561737A (zh) 基于用户行为基线的密码有效性检测方法及其相关设备
WO2023134232A1 (fr) Procédé, appareil et dispositif de mise à jour d'une base de données de vecteurs de caractéristiques, et support
US11582336B1 (en) System and method for gender based authentication of a caller
CN110457877B (zh) 用户认证方法和装置、电子设备、计算机可读存储介质
WO2021257000A1 (fr) Vérification de locuteur intermodale
WO2020191547A1 (fr) Procédé et appareil de reconnaissance biométrique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18910500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18910500

Country of ref document: EP

Kind code of ref document: A1