CN108564955A - Electronic device, auth method and computer readable storage medium - Google Patents

Electronic device, auth method and computer readable storage medium Download PDF

Info

Publication number
CN108564955A
CN108564955A CN201810225887.2A CN201810225887A CN108564955A CN 108564955 A CN108564955 A CN 108564955A CN 201810225887 A CN201810225887 A CN 201810225887A CN 108564955 A CN108564955 A CN 108564955A
Authority
CN
China
Prior art keywords
voice data
preset
layer
neural network
frame group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810225887.2A
Other languages
Chinese (zh)
Other versions
CN108564955B (en
Inventor
赵峰
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810225887.2A priority Critical patent/CN108564955B/en
Priority to PCT/CN2018/102105 priority patent/WO2019179029A1/en
Publication of CN108564955A publication Critical patent/CN108564955A/en
Application granted granted Critical
Publication of CN108564955B publication Critical patent/CN108564955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Abstract

The present invention discloses a kind of electronic device, auth method and storage medium, this method:After receiving the current speech data of target user of pending authentication, the corresponding standard voice data of identity to be verified is obtained, the two standard voice datas are distinguished into sub-frame processing, to obtain current speech frame group and received pronunciation frame group;Extract the preset kind acoustic feature of each speech frame in two speech frame groups respectively using Predetermined filter;The preset kind acoustic feature extracted is inputted into trained preset structure deep neural network model in advance respectively, to obtain the characteristic vector of current speech data and the corresponding preset length of standard voice data;The cosine similarity for two characteristic vectors being calculated, and authentication result is determined according to calculated cosine similarity size.Technical solution of the present invention improves the accuracy of speaker's identity verification.

Description

Electronic device, auth method and computer readable storage medium
Technical field
The present invention relates to sound groove recognition technology in e field, more particularly to a kind of electronic device, auth method and computer Readable storage medium storing program for executing.
Background technology
Speaker Identification is commonly referred to as Application on Voiceprint Recognition, is one kind of biological identification technology, is often used to confirm that certain section of voice Whether it is described in specified someone, is " one-to-one differentiation " problem.Speaker Identification is widely used in numerous areas, for example, The demand of being widely applied is suffered from fields such as finance, security, social security, public security, army and other civil safety certifications.
Speaker Identification includes text Classical correlation and the unrelated identification two ways of text, in recent years the unrelated speaker of text Identification technology is constantly broken through, and accuracy has great promotion as compared with the past.However under certain limited situations, for example acquire In the case of the speaker's efficient voice shorter (duration is less than 5 seconds voices) arrived, the existing unrelated Speaker Identification skill of text The accuracy of art is not high, it is easy to malfunction.
Invention content
The main object of the present invention is to provide a kind of electronic device, auth method and computer readable storage medium, It is intended to promote the accuracy of speaker's identity verification.
To achieve the above object, electronic device proposed by the present invention, including memory and processor are deposited on the memory The authentication system that can be run on the processor is contained, the authentication system is realized when being executed by the processor Following steps:
After receiving the current speech data of target user of pending authentication, obtained from database to be verified The corresponding standard voice data of identity, by the current speech data and standard voice data respectively according to preset framing parameter Sub-frame processing is carried out, it is corresponding to obtain the corresponding current speech frame group of the current speech data and the standard voice data Received pronunciation frame group;
Extracted respectively using Predetermined filter in current speech frame group the preset kind acoustic feature of each speech frame and The preset kind acoustic feature of each speech frame in received pronunciation frame group;
It is respectively that the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame group that extract is corresponding The advance trained preset structure deep neural network model of preset kind acoustic feature input, to obtain the current speech number According to the characteristic vector with the corresponding preset length of the standard voice data;
The cosine similarity for two characteristic vectors being calculated, and body is determined according to calculated cosine similarity size Part verification result, the authentication result include being verified result and authentication failed result.
Preferably, divided respectively according to preset framing parameter by the current speech data and standard voice data Before the step of frame processing, which is additionally operable to execute the authentication system, to realize following steps:
Active endpoint detection is carried out to the current speech data and standard voice data respectively, by the current speech number According to the voice deletion with the non-talking people in the standard voice data.
Preferably, the training process of the preset structure deep neural network model is:
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents corresponding speak The label of personal part;
S2, active endpoint detection is carried out to each voice data sample respectively, by non-talking people in voice data sample Voice is deleted, and the standard voice data sample of preset quantity is obtained;
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage is as verification Collection, first percentage and the second percentage and less than or equal to 100%;
S4, each standard voice data sample that the training set and verification are concentrated is distinguished according to preset framing parameter Sub-frame processing is carried out, to obtain the corresponding speech frame group of each standard voice data sample, Predetermined filter is recycled to carry respectively Take out the preset kind acoustic feature of each speech frame in each speech frame group;
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M crowd, it is in batches defeated Enter and be iterated training in the preset structure deep neural network model, and in the preset structure deep neural network model After the completion of training, verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the voice data sample of acquisition Quantity, and above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
Preferably, the process of the preset structure deep neural network model repetitive exercise includes:
The corresponding preset kind acoustic feature of each speech frame group will be currently inputted according to the parameter current of model to be converted into The characteristic vector of a corresponding preset length;
It is randomly selected from each characteristic vector to obtain multiple triples, i-th of triple (xi1,xi2,xi3) by Three different characteristic vector xi1、xi2And xi3Composition, wherein xi1And xi2The corresponding same speaker, xi1And xi3It is corresponding different Speaker, i is positive integer;
X is calculated using predetermined calculation formulai1And xi2Between cosine similarityAnd calculate xi1And xi3Between Cosine similarity
According to cosine similarityAnd the parameter of predetermined loss function L more new models, it is described true in advance The formula of fixed loss function L is:Wherein α is that value range is normal between 0.05~0.2 Amount, N are the numbers of the triple obtained.
Preferably, the network structure of the preset structure deep neural network model is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using side by side An one forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;
The second layer:It is average layer, the effect of this layer was axially averaged to vector sequence along the time, it will be before last layer The vector sequence exported to LSTM and backward LSTM all averages, and it is backward average to obtain a Forward averaging vector sum one Vector, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to obtain length Normalization after characteristic vector;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is value model The constant being trapped among between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented,Generation Table is not belonging to the cosine similarity of two characteristic vectors of same speaker.
The present invention also proposes a kind of auth method, which includes:
After receiving the current speech data of target user of pending authentication, obtained from database to be verified The corresponding standard voice data of identity, by the current speech data and standard voice data respectively according to preset framing parameter Sub-frame processing is carried out, it is corresponding to obtain the corresponding current speech frame group of the current speech data and the standard voice data Received pronunciation frame group;
Extracted respectively using Predetermined filter in current speech frame group the preset kind acoustic feature of each speech frame and The preset kind acoustic feature of each speech frame in received pronunciation frame group;
It is respectively that the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame group that extract is corresponding The advance trained preset structure deep neural network model of preset kind acoustic feature input, to obtain the current speech number According to the characteristic vector with the corresponding preset length of the standard voice data;
The cosine similarity for two characteristic vectors being calculated, and body is determined according to calculated cosine similarity size Part verification result, the authentication result include being verified result and authentication failed result.
Preferably, divided respectively according to preset framing parameter by the current speech data and standard voice data Before the step of frame processing, the auth method further includes step:
Active endpoint detection is carried out to the current speech data and standard voice data respectively, by the current speech number According to the voice deletion with the non-talking people in the standard voice data.
Preferably, the training process of the preset structure deep neural network model is:
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents corresponding speak The label of personal part;
S2, active endpoint detection is carried out to each voice data sample respectively, by non-talking people in voice data sample Voice is deleted, and the standard voice data sample of preset quantity is obtained;
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage is as verification Collection, first percentage and the second percentage and less than or equal to 100%;
S4, each standard voice data sample that the training set and verification are concentrated is distinguished according to preset framing parameter Sub-frame processing is carried out, to obtain the corresponding speech frame group of each standard voice data sample, Predetermined filter is recycled to carry respectively Take out the preset kind acoustic feature of each speech frame in each speech frame group;
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M crowd, it is in batches defeated Enter and be iterated training in the preset structure deep neural network model, and in the preset structure deep neural network model After the completion of training, verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the voice data sample of acquisition Quantity, and above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
Preferably, the network structure of the preset structure deep neural network model is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using side by side An one forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;
The second layer:It is average layer, the effect of this layer was axially averaged to vector sequence along the time, it will be before last layer The vector sequence exported to LSTM and backward LSTM all averages, and it is backward average to obtain a Forward averaging vector sum one Vector, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to obtain length Normalization after characteristic vector;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is value model The constant being trapped among between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented,Generation Table is not belonging to the cosine similarity of two characteristic vectors of same speaker.
The present invention also proposes that a kind of computer readable storage medium, the computer-readable recording medium storage have identity to test Card system, the authentication system can be executed by least one processor, so that at least one processor execution is as follows Step:
After receiving the current speech data of target user of pending authentication, obtained from database to be verified The corresponding standard voice data of identity, by the current speech data and standard voice data respectively according to preset framing parameter Sub-frame processing is carried out, it is corresponding to obtain the corresponding current speech frame group of the current speech data and the standard voice data Received pronunciation frame group;
Extracted respectively using Predetermined filter in current speech frame group the preset kind acoustic feature of each speech frame and The preset kind acoustic feature of each speech frame in received pronunciation frame group;
It is respectively that the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame group that extract is corresponding The advance trained preset structure deep neural network model of preset kind acoustic feature input, to obtain the current speech number According to the characteristic vector with the corresponding preset length of the standard voice data, the cosine for two characteristic vectors being calculated Similarity, and authentication is determined as a result, the authentication result includes verification according to calculated cosine similarity size Pass through result and authentication failed result.
Technical solution of the present invention will be by that will receive the current speech data of the target user of identity to be verified and to be verified The standard voice data of identity first carries out sub-frame processing, and each speech frame that sub-frame processing obtains is extracted using Predetermined filter Preset kind acoustic feature is extracted, then the preset kind acoustic feature extracted is input to advance trained preset structure Deep neural network model, preset structure deep neural network model is respectively by the corresponding preset kind acoustics of current speech data After feature and the corresponding preset kind acoustic feature of standard voice data are converted into corresponding feature vector, calculate two features to The cosine similarity of amount confirms verification result according to cosine similarity size.The present embodiment technical solution, by by voice data First sub-frame processing is multiple speech frames and extracts preset kind acoustic feature according to speech frame so that even if collected effective Voice data very in short-term, can also extract and be extracted to obtain enough acoustic features according to collected voice data, then pass through instruction The deep neural network model perfected is handled according to extracting to obtain acoustic feature, to export verification result, compared to existing For having technology, accuracy and reliability higher that this programme verifies speaker's identity.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with The structure shown according to these attached drawings obtains other attached drawings.
Fig. 1 is the flow diagram of one embodiment of auth method of the present invention;
Fig. 2 is the flow diagram of preset structure deep neural network model training process of the present invention;
Fig. 3 is the running environment schematic diagram of one embodiment of authentication system of the present invention;
Fig. 4 is the Program modual graph of one embodiment of authentication system of the present invention;
Fig. 5 is the Program modual graph of two embodiment of authentication system of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the present invention.
As shown in FIG. 1, FIG. 1 is the flow diagrams of one embodiment of auth method of the present invention.
In the present embodiment, which includes:
Step S10 is obtained after receiving the current speech data of target user of pending authentication from database The corresponding standard voice data of identity to be verified, by the current speech data and standard voice data respectively according to preset Framing parameter carries out sub-frame processing, to obtain the corresponding current speech frame group of the current speech data and the received pronunciation number According to corresponding received pronunciation frame group;
It is previously stored with the standard voice data of each identity in the database of authentication system, is receiving pending body After the current speech data of the target user of part verification, the identity (identity to be verified) of verification, body are required according to target user Part verification system obtains the corresponding standard voice data of identity to be verified in the database, then again respectively to receiving Current speech data and the standard voice data got carry out sub-frame processing according to preset framing parameter, to obtain described work as The corresponding current speech frame group of preceding voice data (including current speech data obtained through framing multiple speech frames) and the mark The corresponding received pronunciation frame group of quasi- voice data (including standard voice data obtained through framing multiple speech frames).Wherein, institute Preset framing parameter is stated for example, every 25 milliseconds of framings, frame moves 10 milliseconds.
Step S20 extracts the preset kind sound of each speech frame in current speech frame group using Predetermined filter respectively Learn the preset kind acoustic feature of each speech frame in feature and received pronunciation frame group;
After obtaining current speech frame group and received pronunciation frame group, authentication system is right respectively using Predetermined filter Each speech frame in current speech frame group and received pronunciation frame group carries out feature extraction, to extract in current speech frame group The corresponding preset kind sound of each speech frame in the corresponding preset kind acoustic feature of each speech frame and received pronunciation frame group Learn feature.For example, the Predetermined filter is Meier (Mel) filter, the preset kind acoustic feature extracted is 36 dimension MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) spectrum signature.
Step S30, the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame that will be extracted respectively The corresponding advance trained preset structure deep neural network model of preset kind acoustic feature input of group, to obtain described work as The characteristic vector of preceding voice data and the corresponding preset length of the standard voice data;
Step S40, the cosine similarity for two characteristic vectors being calculated, and it is big according to calculated cosine similarity Small determining authentication is as a result, the authentication result includes being verified result and authentication failed result.
It is using sample to have advance trained preset structure deep neural network model, the model in authentication system The model of the corresponding preset kind acoustic feature repetitive exercise of this voice data;Authentication system is to current speech frame group After carrying out feature extraction with the speech frame in received pronunciation frame group, by the corresponding preset kind acoustic feature of current speech frame group and The corresponding preset kind acoustic feature of received pronunciation frame group inputs the trained preset structure deep neural network model in advance In, model is special by the corresponding preset kind acoustic feature of current speech frame group and the corresponding preset kind acoustics of received pronunciation frame group Sign is separately converted to the characteristic vector (for example, characteristic vector that length is 1) of a preset length, then two spies being calculated The cosine similarity for levying vector, determines authentication as a result, will the cosine phase according to the size of calculated cosine similarity Like degree compared with predetermined threshold value (such as 0.95), if the cosine similarity is more than predetermined threshold value, it is determined that authentication passes through, instead It, it is determined that authentication fails.Wherein, cosine similarity calculation formula is:cos(xi,xj)=xi Txj, xiAnd xjRepresent two A characteristic vector, T are to predefine value.
The present embodiment technical solution, by the way that the current speech data of the target user of identity to be verified and to be tested will be received The standard voice data of card identity first carries out sub-frame processing, each speech frame obtained using Predetermined filter extraction sub-frame processing Extract preset kind acoustic feature, then the preset kind acoustic feature extracted is input to advance trained default knot Structure deep neural network model, preset structure deep neural network model is respectively by the corresponding preset kind sound of current speech data After feature and the corresponding preset kind acoustic feature of standard voice data are converted into corresponding feature vector, two features are calculated The cosine similarity of vector confirms verification result according to cosine similarity size.The present embodiment technical solution, by by voice number It is multiple speech frames according to first sub-frame processing and preset kind acoustic feature is extracted according to speech frame so that even if has collected It imitates voice data very in short-term, can also extract and be extracted to obtain enough acoustic features according to collected voice data, then pass through Trained deep neural network model is handled according to extracting to obtain acoustic feature, to export verification result, compared to For the prior art, accuracy and reliability higher that this programme verifies speaker's identity.
Further, the present embodiment by the current speech data and standard voice data respectively according to preset framing Before parameter carries out the step of sub-frame processing, the auth method further includes step:
Active endpoint detection is carried out to the current speech data and standard voice data respectively, by the current speech number According to the voice deletion with the non-talking people in the standard voice data.
All include some non-talking human speech sounds in the current speech data and pre-stored standard voice data of acquisition Partly (for example, mute or noise), if these parts are not deleted, to current speech data or standard to voice data into In the speech frame group obtained after row sub-frame processing, it may appear that speech frame (or even the individual voice comprising non-talking people's phonological component Non-talking human speech sound is all in frame), in this way, including the speech frame of non-talking people's phonological component according to these using Predetermined filter The preset kind acoustic feature extracted belongs to impurity characteristics, can reduce preset structure deep neural network model and obtain a result Accuracy;Therefore the present embodiment first detects in current speech data and standard voice data before to voice data sub-frame processing Non-talking people's phonological component, and by the non-talking people's phonological component detected delete, the present embodiment use non-talking human speech The detection mode of line point is that active endpoint detects (Voice Activity Detection, VAD).
As shown in Fig. 2, in the present embodiment, the training process of the preset structure deep neural network model is:
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents corresponding speak The label of personal part;
Preset quantity (for example, 10000) voice data sample is first got out, each voice data sample is all known theory The voice data of the personal part of words;In these voice data samples, each speaker's identity or partial speaker's identity correspond to There are multiple voice data samples, the label of corresponding speaker's identity will be represented on each voice data sample mark.
S2, active endpoint detection is carried out to each voice data sample respectively, by non-talking people in voice data sample Voice is deleted, and the standard voice data sample of preset quantity is obtained;
Active endpoint detection is carried out to voice data sample, to detect the non-talking people's in each voice data sample Voice (for example, mute or noise) is simultaneously deleted, and avoids existing in voice data sample special with the vocal print of corresponding speaker's identity Unrelated voice data is levied, and influences the training effect to model.
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage is as verification Collection, first percentage and the second percentage and less than or equal to 100%;
For example, being used as training set by the 70% of obtained standard voice data sample, 30% as verification collection.
S4, each standard voice data sample that the training set and verification are concentrated is distinguished according to preset framing parameter Sub-frame processing is carried out, to obtain the corresponding speech frame group of each standard voice data sample, Predetermined filter is recycled to carry respectively Take out the preset kind acoustic feature of each speech frame in each speech frame group;
Wherein, for preset framing parameter for example, every 25 milliseconds of framings, frame moves 10 milliseconds;The Predetermined filter is, for example, Meier filter is MFCC (Mel Frequency by the preset kind acoustic feature that Meier filter extracts Cepstrum Coefficient, mel-frequency cepstrum coefficient) spectrum signature, for example, 36 dimension MFCC spectrum signatures.
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M crowd, it is in batches defeated Enter and be iterated training in the preset structure deep neural network model, and in the preset structure deep neural network model After the completion of training, verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
Batch processing is carried out to the preset kind acoustic feature in training set, is divided into M (such as 30) batches, batch mode can It is allocation unit according to speech frame group, the corresponding preset kind acoustics of the speech frame group of distribution equivalent or inequality is special in every a batch Sign;The corresponding preset kind acoustic feature of each speech frame group in training set is preset according to the input of the batch being divided into one by one It is iterated training in constructional depth neural network model, the preset structure victory is made to read god per a batch preset kind acoustic feature Primary through network model iteration, each iteration can all update to obtain new model parameter, should after the completion of being trained by successive ignition Preset structure deep neural network model has been updated to preferable model parameter;After the completion of repetitive exercise, then verification collection is utilized The accuracy rate of the preset structure deep neural network model is verified, i.e., is divided the standard voice data that verification is concentrated two-by-two Group, each corresponding preset kind acoustic feature of standard voice data sample inputted in a grouping to the preset structure depth Neural network model confirms whether the verification structure of output is correct according to the identity label of the two of input standard voice datas, After completing to the verification of each grouping, accuracy rate is calculated according to the correct number of verification result, such as test 100 groupings Card, finally obtaining verification result correctly has 99 groups, then accuracy rate is just 99%.
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
Verification threshold (the i.e. described predetermined threshold value, such as 98.5%), for institute of accuracy rate is pre-set in system The training effect for stating preset structure deep neural network model is tested;If deep by preset structure described in the verification set pair The accuracy rate that degree neural network model is verified is more than the predetermined threshold value, then illustrating the preset structure deep neural network The training of model has reached standard, then terminates model training at this time.
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the voice data sample of acquisition Quantity, and above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
If being less than by the accuracy rate that preset structure deep neural network model is verified described in the verification set pair Or it is equal to the predetermined threshold value, then illustrating that the training of the preset structure deep neural network model has reached expected mark not yet It is accurate, it may be possible to training set quantity is not enough or verification collection quantity is inadequate, so, when this is the case, then increasing the voice number of acquisition According to the quantity (for example, increase fixed quantity every time or increase random amount every time) of sample, then on the basis of this, re-execute Above-mentioned steps S1-S5, so cycle are executed, until having reached the requirement of step S6, then terminate model training.
In the present embodiment, the process of the preset structure deep neural network model repetitive exercise includes:
The corresponding preset kind acoustic feature of each speech frame group will be currently inputted according to the parameter current of model to be converted into The characteristic vector of a corresponding preset length;
It is randomly selected from each characteristic vector to obtain multiple triples, i-th of triple (xi1,xi2,xi3) by Three different characteristic vector xi1、xi2And xi3Composition, wherein xi1And xi2The corresponding same speaker, xi1And xi3It is corresponding different Speaker, i is positive integer;
X is calculated using predetermined calculation formulai1And xi2Between cosine similarityAnd calculate xi1And xi3Between Cosine similarity
According to cosine similarityAnd the parameter of predetermined loss function L more new models, it is described true in advance The formula of fixed loss function L is:Wherein α is that value range is normal between 0.05~0.2 Amount, N are the numbers of the triple obtained.
Wherein, model parameter update step is:1. calculating the preset structure deep neural network using back-propagation algorithm Gradient;2. updating the preset structure deep neural network using mini-batch-SGD (small lot stochastic gradient descent) method Parameter.
Further, the network structure of the preset structure deep neural network model of the present embodiment is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using side by side An one forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;Forward direction LSTM and backward LSTM difference Export a vector sequence;
The second layer:It is average layer, the effect of this layer was axially averaged to vector sequence along the time, it will be before last layer The vector sequence exported to LSTM and backward LSTM all averages, and it is backward average to obtain a Forward averaging vector sum one Vector, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to obtain length Normalization after characteristic vector;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is value Constant of the range between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented, Represent the cosine similarity for being not belonging to two characteristic vectors of same speaker.
In addition, the present invention also proposes a kind of authentication system.
Referring to Fig. 3, being the running environment schematic diagram of 10 preferred embodiment of authentication system of the present invention.
In the present embodiment, authentication system 10 is installed and is run in electronic device 1.Electronic device 1 can be table The computing devices such as laptop computer, notebook, palm PC and server.The electronic device 1 may include, but be not limited only to, and deposit Reservoir 11, processor 12 and display 13.Fig. 3 illustrates only the electronic device 1 with component 11-13, it should be understood that It is not required for implementing all components shown, the implementation that can be substituted is more or less component.
Memory 11 can be the internal storage unit of electronic device 1 in some embodiments, such as the electronic device 1 Hard disk or memory.Memory 11 can also be the External memory equipment of electronic device 1, such as electronics dress in further embodiments Set the plug-in type hard disk being equipped on 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also both include the interior of electronic device 1 Portion's storage unit also includes External memory equipment.Memory 11 is for storing the application software for being installed on electronic device 1 and all kinds of Data, for example, authentication system 10 program code etc..Memory 11 can be also used for temporarily storing exported or The data that will be exported.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chips, the program code for being stored in run memory 11 or processing data, example Such as execute authentication system 10.
Display 13 can be in some embodiments light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Display 13 is for being shown in The information that is handled in electronic device 1 and for showing visual user interface, such as business customizing interface etc..Electronic device 1 component 11-13 is in communication with each other by system bus.
Referring to Fig. 4, being the Program modual graph of 10 preferred embodiment of authentication system of the present invention.In the present embodiment, Authentication system 10 can be divided into one or more modules, one or more module is stored in memory 11, And it is performed by one or more processors (the present embodiment is processor 12), to complete the present invention.For example, in Fig. 4, identity Verification system 10 can be divided into framing module 101, extraction module 102, computing module 103 and result determining module 104.This It is the series of computation machine program instruction section for referring to complete specific function to invent so-called module, than program more suitable for description The implementation procedure of authentication system 10 in the electronic apparatus 1, wherein:
Framing module 101, for after receiving the current speech data of target user of pending authentication, from data The corresponding standard voice data of identity to be verified is obtained in library, and the current speech data and standard voice data are pressed respectively Sub-frame processing is carried out according to preset framing parameter, to obtain the corresponding current speech frame group of the current speech data and the mark The corresponding received pronunciation frame group of quasi- voice data;
It is previously stored with the standard voice data of each identity in the database of authentication system, is receiving pending body After the current speech data of the target user of part verification, the identity (identity to be verified) of verification, body are required according to target user Part verification system obtains the corresponding standard voice data of identity to be verified in the database, then again respectively to receiving Current speech data and the standard voice data got carry out sub-frame processing according to preset framing parameter, to obtain described work as The corresponding current speech frame group of preceding voice data (including current speech data obtained through framing multiple speech frames) and the mark The corresponding received pronunciation frame group of quasi- voice data (including standard voice data obtained through framing multiple speech frames).Wherein, institute Preset framing parameter is stated for example, every 25 milliseconds of framings, frame moves 10 milliseconds.
Extraction module 102, for extracting the pre- of each speech frame in current speech frame group respectively using Predetermined filter If the preset kind acoustic feature of each speech frame in type acoustic feature and received pronunciation frame group;
After obtaining current speech frame group and received pronunciation frame group, authentication system is right respectively using Predetermined filter Each speech frame in current speech frame group and received pronunciation frame group carries out feature extraction, to extract in current speech frame group The corresponding preset kind sound of each speech frame in the corresponding preset kind acoustic feature of each speech frame and received pronunciation frame group Learn feature.For example, the Predetermined filter is Meier (Mel) filter, the preset kind acoustic feature extracted is 36 dimension MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) spectrum signature.
Computing module 103, the corresponding preset kind acoustic feature of current speech frame group and mark for will extract respectively The corresponding preset kind acoustic feature input of quasi- speech frame group trained preset structure deep neural network model in advance, with To the characteristic vector of the current speech data and the corresponding preset length of the standard voice data;
As a result determining module 104, the cosine similarity of two characteristic vectors for being calculated, and according to calculated Cosine similarity size determines authentication as a result, the authentication result includes being verified result and authentication failed knot Fruit.
It is using sample to have advance trained preset structure deep neural network model, the model in authentication system The model of the corresponding preset kind acoustic feature repetitive exercise of this voice data;Authentication system is to current speech frame group After carrying out feature extraction with the speech frame in received pronunciation frame group, by the corresponding preset kind acoustic feature of current speech frame group and The corresponding preset kind acoustic feature of received pronunciation frame group inputs the trained preset structure deep neural network model in advance In, model is special by the corresponding preset kind acoustic feature of current speech frame group and the corresponding preset kind acoustics of received pronunciation frame group Sign is separately converted to the characteristic vector (for example, characteristic vector that length is 1) of a preset length, then two spies being calculated The cosine similarity for levying vector, determines authentication as a result, will the cosine phase according to the size of calculated cosine similarity Like degree compared with predetermined threshold value (such as 0.95), if the cosine similarity is more than predetermined threshold value, it is determined that authentication passes through, instead It, it is determined that authentication fails.Wherein, cosine similarity calculation formula is:cos(xi,xj)=xi Txj, xiAnd xjRepresent two A characteristic vector, T are to predefine value.
The present embodiment technical solution, by the way that the current speech data of the target user of identity to be verified and to be tested will be received The standard voice data of card identity first carries out sub-frame processing, each speech frame obtained using Predetermined filter extraction sub-frame processing Extract preset kind acoustic feature, then the preset kind acoustic feature extracted is input to advance trained default knot Structure deep neural network model, preset structure deep neural network model is respectively by the corresponding preset kind sound of current speech data After feature and the corresponding preset kind acoustic feature of standard voice data are converted into corresponding feature vector, two features are calculated The cosine similarity of vector confirms verification result according to cosine similarity size.The present embodiment technical solution, by by voice number It is multiple speech frames according to first sub-frame processing and preset kind acoustic feature is extracted according to speech frame so that even if has collected It imitates voice data very in short-term, can also extract and be extracted to obtain enough acoustic features according to collected voice data, then pass through Trained deep neural network model is handled according to extracting to obtain acoustic feature, to export verification result, compared to For the prior art, accuracy and reliability higher that this programme verifies speaker's identity.
As shown in figure 5, Fig. 5 is the Program modual graph of two embodiment of authentication system of the present invention.
In the present embodiment, authentication system further includes:
Detection module 105, for by current speech data and standard voice data respectively according to preset framing parameter Before carrying out sub-frame processing, active endpoint detection is carried out to the current speech data and standard voice data respectively, it will be described The voice of non-talking people in current speech data and the standard voice data is deleted.
All include some non-talking human speech sounds in the current speech data and pre-stored standard voice data of acquisition Partly (for example, mute or noise), if these parts are not deleted, to current speech data or standard to voice data into In the speech frame group obtained after row sub-frame processing, it may appear that speech frame (or even the individual voice comprising non-talking people's phonological component Non-talking human speech sound is all in frame), in this way, including the speech frame of non-talking people's phonological component according to these using Predetermined filter The preset kind acoustic feature extracted belongs to impurity characteristics, can reduce preset structure deep neural network model and obtain a result Accuracy;Therefore the present embodiment first detects in current speech data and standard voice data before to voice data sub-frame processing Non-talking people's phonological component, and by the non-talking people's phonological component detected delete, the present embodiment use non-talking human speech The detection mode of line point is that active endpoint detects (Voice Activity Detection, VAD).
In the present embodiment, the training process of the preset structure deep neural network model is (can refer to Fig. 2):
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents corresponding speak The label of personal part;
Preset quantity (for example, 10000) voice data sample is first got out, each voice data sample is all known theory The voice data of the personal part of words;In these voice data samples, each speaker's identity or partial speaker's identity correspond to There are multiple voice data samples, the label of corresponding speaker's identity will be represented on each voice data sample mark.
S2, active endpoint detection is carried out to each voice data sample respectively, by non-talking people in voice data sample Voice is deleted, and the standard voice data sample of preset quantity is obtained;
Active endpoint detection is carried out to voice data sample, to detect the non-talking people's in each voice data sample Voice (for example, mute or noise) is simultaneously deleted, and avoids existing in voice data sample special with the vocal print of corresponding speaker's identity Unrelated voice data is levied, and influences the training effect to model.
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage is as verification Collection, first percentage and the second percentage and less than or equal to 100%;
For example, being used as training set by the 70% of obtained standard voice data sample, 30% as verification collection.
S4, each standard voice data sample that the training set and verification are concentrated is distinguished according to preset framing parameter Sub-frame processing is carried out, to obtain the corresponding speech frame group of each standard voice data sample, Predetermined filter is recycled to carry respectively Take out the preset kind acoustic feature of each speech frame in each speech frame group;
Wherein, for preset framing parameter for example, every 25 milliseconds of framings, frame moves 10 milliseconds;The Predetermined filter is, for example, Meier filter is MFCC (Mel Frequency by the preset kind acoustic feature that Meier filter extracts Cepstrum Coefficient, mel-frequency cepstrum coefficient) spectrum signature, for example, 36 dimension MFCC spectrum signatures.
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M crowd, it is in batches defeated Enter and be iterated training in the preset structure deep neural network model, and in the preset structure deep neural network model After the completion of training, verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
Batch processing is carried out to the preset kind acoustic feature in training set, is divided into M (such as 30) batches, batch mode can It is allocation unit according to speech frame group, the corresponding preset kind acoustics of the speech frame group of distribution equivalent or inequality is special in every a batch Sign;The corresponding preset kind acoustic feature of each speech frame group in training set is preset according to the input of the batch being divided into one by one It is iterated training in constructional depth neural network model, the preset structure victory is made to read god per a batch preset kind acoustic feature Primary through network model iteration, each iteration can all update to obtain new model parameter, should after the completion of being trained by successive ignition Preset structure deep neural network model has been updated to preferable model parameter;After the completion of repetitive exercise, then verification collection is utilized The accuracy rate of the preset structure deep neural network model is verified, i.e., is divided the standard voice data that verification is concentrated two-by-two Group, each corresponding preset kind acoustic feature of standard voice data sample inputted in a grouping to the preset structure depth Neural network model confirms whether the verification structure of output is correct according to the identity label of the two of input standard voice datas, After completing to the verification of each grouping, accuracy rate is calculated according to the correct number of verification result, such as test 100 groupings Card, finally obtaining verification result correctly has 99 groups, then accuracy rate is just 99%.
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
Verification threshold (the i.e. described predetermined threshold value, such as 98.5%), for institute of accuracy rate is pre-set in system The training effect for stating preset structure deep neural network model is tested;If deep by preset structure described in the verification set pair The accuracy rate that degree neural network model is verified is more than the predetermined threshold value, then illustrating the preset structure deep neural network The training of model has reached standard, then terminates model training at this time.
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the voice data sample of acquisition Quantity, and above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
If being less than by the accuracy rate that preset structure deep neural network model is verified described in the verification set pair Or it is equal to the predetermined threshold value, then illustrating that the training of the preset structure deep neural network model has reached expected mark not yet It is accurate, it may be possible to training set quantity is not enough or verification collection quantity is inadequate, so, when this is the case, then increasing the voice number of acquisition According to the quantity (for example, increase fixed quantity every time or increase random amount every time) of sample, then on the basis of this, re-execute Above-mentioned steps S1-S5, so cycle are executed, until having reached the requirement of step S6, then terminate model training.
In the present embodiment, the process of the preset structure deep neural network model repetitive exercise includes:
The corresponding preset kind acoustic feature of each speech frame group will be currently inputted according to the parameter current of model to be converted into The characteristic vector of a corresponding preset length;
It is randomly selected from each characteristic vector to obtain multiple triples, i-th of triple (xi1,xi2,xi3) by Three different characteristic vector xi1、xi2And xi3Composition, wherein xi1And xi2The corresponding same speaker, xi1And xi3It is corresponding different Speaker, i is positive integer;
X is calculated using predetermined calculation formulai1And xi2Between cosine similarityAnd calculate xi1And xi3Between Cosine similarity
According to cosine similarityAnd the parameter of predetermined loss function L more new models, it is described true in advance The formula of fixed loss function L is:Wherein α is that value range is normal between 0.05~0.2 Amount, N are the numbers of the triple obtained.
Wherein, model parameter update step is:1. calculating the preset structure deep neural network using back-propagation algorithm Gradient;2. updating the preset structure deep neural network using mini-batch-SGD (small lot stochastic gradient descent) method Parameter.
Further, the network structure of the preset structure deep neural network model of the present embodiment is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using side by side An one forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;Forward direction LSTM and backward LSTM difference Export a vector sequence;
The second layer:It is average layer, the effect of this layer was axially averaged to vector sequence along the time, it will be before last layer The vector sequence exported to LSTM and backward LSTM all averages, and it is backward average to obtain a Forward averaging vector sum one Vector, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to obtain length Normalization after characteristic vector;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is value Constant of the range between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented, Represent the cosine similarity for being not belonging to two characteristic vectors of same speaker.
Further, the present invention also proposes that a kind of computer readable storage medium, the computer readable storage medium are deposited Authentication system is contained, the authentication system can be executed by least one processor, so that at least one processing Device executes the auth method in any of the above-described embodiment.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every at this Under the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectly In the scope of patent protection that other related technical areas are included in the present invention.

Claims (10)

1. a kind of electronic device, which is characterized in that the electronic device includes memory and processor, is stored on the memory There is the authentication system that can be run on the processor, is realized such as when the authentication system is executed by the processor Lower step:
After receiving the current speech data of target user of pending authentication, identity to be verified is obtained from database Corresponding standard voice data carries out the current speech data and standard voice data according to preset framing parameter respectively Sub-frame processing, to obtain the corresponding current speech frame group of the current speech data and the corresponding standard of the standard voice data Speech frame group;
Extract the preset kind acoustic feature and standard of each speech frame in current speech frame group respectively using Predetermined filter The preset kind acoustic feature of each speech frame in speech frame group;
It is respectively that the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame group that extract is corresponding default Type acoustic feature input in advance trained preset structure deep neural network model, with obtain the current speech data and The characteristic vector of the corresponding preset length of standard voice data;
The cosine similarity for two characteristic vectors being calculated, and determine that identity is tested according to calculated cosine similarity size Card is as a result, the authentication result includes being verified result and authentication failed result.
2. electronic device as described in claim 1, which is characterized in that by the current speech data and standard voice data Before the step of carrying out sub-frame processing according to preset framing parameter respectively, which is additionally operable to execute the authentication system System, to realize following steps:
Active endpoint detection is carried out to the current speech data and standard voice data respectively, by the current speech data and The voice of non-talking people in the standard voice data is deleted.
3. electronic device as described in claim 1, which is characterized in that the training of the preset structure deep neural network model Process is:
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents the corresponding person of speaking The label of part;
S2, active endpoint detection is carried out to each voice data sample respectively, by the voice of non-talking people in voice data sample It deletes, obtains the standard voice data sample of preset quantity;
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage collects as verification, institute State the first percentage and the second percentage and it is less than or equal to 100%;
S4, each standard voice data sample that the training set and verification are concentrated is carried out respectively according to preset framing parameter Sub-frame processing recycles Predetermined filter to extract respectively to obtain the corresponding speech frame group of each standard voice data sample The preset kind acoustic feature of each speech frame in each speech frame group;
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M batches, inputs institute in batches It states and is iterated training in preset structure deep neural network model, and trained in the preset structure deep neural network model After the completion, it is verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the quantity of the voice data sample of acquisition, And above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
4. electronic device as claimed in claim 3, which is characterized in that the preset structure deep neural network model iteration instruction Experienced process includes:
The corresponding preset kind acoustic feature of each speech frame group, which will be currently inputted, according to the parameter current of model is converted into correspondence A preset length characteristic vector;
It is randomly selected from each characteristic vector to obtain multiple triples, i-th of triple (xi1,xi2,xi3) by three Different characteristic vector xi1、xi2And xi3Composition, wherein xi1And xi2The corresponding same speaker, xi1And xi3Correspond to different say People is talked about, i is positive integer;
X is calculated using predetermined calculation formulai1And xi2Between cosine similarityAnd calculate xi1And xi3Between it is remaining String similarity
According to cosine similarityAnd the parameter of predetermined loss function L more new models, the predetermined damage The formula for losing function L is:Wherein α is value range constant between 0.05~0.2, and N is The number of the triple of acquisition.
5. the electronic device as described in any one of claim 1-4, which is characterized in that the preset structure depth nerve net The network structure of network model is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using one side by side A forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;
The second layer:Average layer, the effect of this layer was axially averaged to vector sequence along the time, it by before last layer to The vector sequence of LSTM and backward LSTM outputs all averages, and obtains the backward average arrow of a Forward averaging vector sum one Amount, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to return to obtain length Characteristic vector after one change;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is that value range exists Constant between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented,It represents not Belong to the cosine similarity of two characteristic vectors of same speaker.
6. a kind of auth method, which is characterized in that the auth method includes:
After receiving the current speech data of target user of pending authentication, identity to be verified is obtained from database Corresponding standard voice data carries out the current speech data and standard voice data according to preset framing parameter respectively Sub-frame processing, to obtain the corresponding current speech frame group of the current speech data and the corresponding standard of the standard voice data Speech frame group;
Extract the preset kind acoustic feature and standard of each speech frame in current speech frame group respectively using Predetermined filter The preset kind acoustic feature of each speech frame in speech frame group;
It is respectively that the corresponding preset kind acoustic feature of current speech frame group and received pronunciation frame group that extract is corresponding default Type acoustic feature input in advance trained preset structure deep neural network model, with obtain the current speech data and The characteristic vector of the corresponding preset length of standard voice data;
The cosine similarity for two characteristic vectors being calculated, and determine that identity is tested according to calculated cosine similarity size Card is as a result, the authentication result includes being verified result and authentication failed result.
7. auth method as claimed in claim 6, which is characterized in that by the current speech data and received pronunciation Before the step of data carry out sub-frame processing according to preset framing parameter respectively, the auth method further includes step:
Active endpoint detection is carried out to the current speech data and standard voice data respectively, by the current speech data and The voice of non-talking people in the standard voice data is deleted.
8. auth method as claimed in claim 6, which is characterized in that the preset structure deep neural network model Training process is:
S1, preset quantity voice data sample is obtained, each voice data sample is marked respectively and represents the corresponding person of speaking The label of part;
S2, active endpoint detection is carried out to each voice data sample respectively, by the voice of non-talking people in voice data sample It deletes, obtains the standard voice data sample of preset quantity;
S3, using the first percentage of obtained standard voice data sample as training set, the second percentage collects as verification, institute State the first percentage and the second percentage and it is less than or equal to 100%;
S4, each standard voice data sample that the training set and verification are concentrated is carried out respectively according to preset framing parameter Sub-frame processing recycles Predetermined filter to extract respectively to obtain the corresponding speech frame group of each standard voice data sample The preset kind acoustic feature of each speech frame in each speech frame group;
S5, the corresponding preset kind acoustic feature of each speech frame group in the training set is divided into M batches, inputs institute in batches It states and is iterated training in preset structure deep neural network model, and trained in the preset structure deep neural network model After the completion, it is verified using the accuracy rate of preset structure deep neural network model described in verification set pair;
If the accuracy rate that S6, verification obtain is more than predetermined threshold value, model training terminates;
If the accuracy rate that S7, verification obtain is less than or equal to predetermined threshold value, increase the quantity of the voice data sample of acquisition, And above-mentioned steps S1-S5 is re-executed based on the voice data sample after increase.
9. the auth method as described in any one of claim 6 to 8, which is characterized in that the preset structure depth The network structure of neural network model is as follows:
First layer:Be several layer heaps it is folded have mutually isostructural neural net layer, wherein every layer of neural network is using one side by side A forward direction shot and long term memory network LSTM and backward LSTM, the number of plies are 1~3 layer;
The second layer:Average layer, the effect of this layer was axially averaged to vector sequence along the time, it by before last layer to The vector sequence of LSTM and backward LSTM outputs all averages, and obtains the backward average arrow of a Forward averaging vector sum one Amount, and by the two mean vector tandems at a vector;
Third layer:It is the full articulamentums of deep neural network DNN;
4th layer:It is normalization layer, the input of last layer is normalized this layer according to L2 norms, and it is 1 to return to obtain length Characteristic vector after one change;
Layer 5:It is loss layer, the formula of loss function L is:Wherein α is that value range exists Constant between 0.05~0.2,The cosine similarity for belonging to two characteristic vectors of same speaker is represented,It represents not Belong to the cosine similarity of two characteristic vectors of same speaker.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has authentication System, the authentication system can be executed by least one processor, so that at least one processor executes such as right It is required that the auth method described in any one of 6-9.
CN201810225887.2A 2018-03-19 2018-03-19 Electronic device, auth method and computer readable storage medium Active CN108564955B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810225887.2A CN108564955B (en) 2018-03-19 2018-03-19 Electronic device, auth method and computer readable storage medium
PCT/CN2018/102105 WO2019179029A1 (en) 2018-03-19 2018-08-24 Electronic device, identity verification method and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810225887.2A CN108564955B (en) 2018-03-19 2018-03-19 Electronic device, auth method and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN108564955A true CN108564955A (en) 2018-09-21
CN108564955B CN108564955B (en) 2019-09-03

Family

ID=63532742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810225887.2A Active CN108564955B (en) 2018-03-19 2018-03-19 Electronic device, auth method and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN108564955B (en)
WO (1) WO2019179029A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564954A (en) * 2018-03-19 2018-09-21 平安科技(深圳)有限公司 Deep neural network model, electronic device, auth method and storage medium
CN109346086A (en) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
CN110148402A (en) * 2019-05-07 2019-08-20 平安科技(深圳)有限公司 Method of speech processing, device, computer equipment and storage medium
CN110289003A (en) * 2018-10-10 2019-09-27 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment
CN111241506A (en) * 2018-11-28 2020-06-05 Sap欧洲公司 Progressive authentication security adapter
CN111933153A (en) * 2020-07-07 2020-11-13 北京捷通华声科技股份有限公司 Method and device for determining voice segmentation points
CN112016673A (en) * 2020-07-24 2020-12-01 浙江工业大学 Mobile equipment user authentication method and device based on optimized LSTM
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112347788A (en) * 2020-11-06 2021-02-09 平安消费金融有限公司 Corpus processing method, apparatus and storage medium
WO2022227223A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Voice verification model training method and apparatus, and computer device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712792A (en) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 Dialect recognition model training method, readable storage medium and terminal device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN106205624A (en) * 2016-07-15 2016-12-07 河海大学 A kind of method for recognizing sound-groove based on DBSCAN algorithm
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN106205624A (en) * 2016-07-15 2016-12-07 河海大学 A kind of method for recognizing sound-groove based on DBSCAN algorithm
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564954B (en) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 Deep neural network model, electronic device, identity verification method, and storage medium
CN108564954A (en) * 2018-03-19 2018-09-21 平安科技(深圳)有限公司 Deep neural network model, electronic device, auth method and storage medium
CN110289003B (en) * 2018-10-10 2021-10-29 腾讯科技(深圳)有限公司 Voiceprint recognition method, model training method and server
CN110289003A (en) * 2018-10-10 2019-09-27 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN109346086A (en) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
CN111241506A (en) * 2018-11-28 2020-06-05 Sap欧洲公司 Progressive authentication security adapter
CN111241506B (en) * 2018-11-28 2023-09-08 Sap欧洲公司 Progressive authentication security adapter
CN110148402A (en) * 2019-05-07 2019-08-20 平安科技(深圳)有限公司 Method of speech processing, device, computer equipment and storage medium
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment
CN111933153A (en) * 2020-07-07 2020-11-13 北京捷通华声科技股份有限公司 Method and device for determining voice segmentation points
CN111933153B (en) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 Voice segmentation point determining method and device
CN112016673A (en) * 2020-07-24 2020-12-01 浙江工业大学 Mobile equipment user authentication method and device based on optimized LSTM
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112347788A (en) * 2020-11-06 2021-02-09 平安消费金融有限公司 Corpus processing method, apparatus and storage medium
WO2022227223A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Voice verification model training method and apparatus, and computer device

Also Published As

Publication number Publication date
WO2019179029A1 (en) 2019-09-26
CN108564955B (en) 2019-09-03

Similar Documents

Publication Publication Date Title
CN108564955B (en) Electronic device, auth method and computer readable storage medium
CN108564954A (en) Deep neural network model, electronic device, auth method and storage medium
TWI641965B (en) Method and system of authentication based on voiceprint recognition
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
CN107610709B (en) Method and system for training voiceprint recognition model
CN107862292B (en) Personage's mood analysis method, device and storage medium
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN107564511A (en) Electronic installation, phoneme synthesizing method and computer-readable recording medium
KR102401194B1 (en) Method and apparatus for authenticating a user using an electrocardiogram signal
CN111898550B (en) Expression recognition model building method and device, computer equipment and storage medium
CN110473552A (en) Speech recognition authentication method and system
CN110211571A (en) Wrong sentence detection method, device and computer readable storage medium
CN110544468B (en) Application awakening method and device, storage medium and electronic equipment
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN108831484A (en) A kind of offline and unrelated with category of language method for recognizing sound-groove and device
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
CN110111798A (en) A kind of method and terminal identifying speaker
CN113807103B (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN114639152A (en) Multi-modal voice interaction method, device, equipment and medium based on face recognition
CN110364163A (en) The identity identifying method that a kind of voice and lip reading blend
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
US6499012B1 (en) Method and apparatus for hierarchical training of speech models for use in speaker verification
CN1522431A (en) Method and system for non-intrusive speaker verification using behavior model
CN116151965A (en) Risk feature extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant