CN110010133A - Vocal print detection method, device, equipment and storage medium based on short text - Google Patents
Vocal print detection method, device, equipment and storage medium based on short text Download PDFInfo
- Publication number
- CN110010133A CN110010133A CN201910167882.3A CN201910167882A CN110010133A CN 110010133 A CN110010133 A CN 110010133A CN 201910167882 A CN201910167882 A CN 201910167882A CN 110010133 A CN110010133 A CN 110010133A
- Authority
- CN
- China
- Prior art keywords
- vocal print
- voice signal
- vector
- neural network
- deep neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001755 vocal effect Effects 0.000 title claims abstract description 160
- 238000001514 detection method Methods 0.000 title claims abstract description 56
- 238000003860 storage Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 151
- 238000013528 artificial neural network Methods 0.000 claims abstract description 84
- 238000012549 training Methods 0.000 claims abstract description 75
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000012360 testing method Methods 0.000 claims abstract description 13
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 30
- 238000000034 method Methods 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 16
- 230000005284 excitation Effects 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 8
- 210000004218 nerve net Anatomy 0.000 claims description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 4
- 230000005236 sound signal Effects 0.000 claims description 4
- 238000009826 distribution Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 3
- 238000009432 framing Methods 0.000 description 3
- 210000005036 nerve Anatomy 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of vocal print detection method, device, equipment and storage medium based on short text, which comprises preset deep neural network is trained using training sample;Obtain voice signal to be identified;The voice signal to be identified is pre-processed, and feature extraction is carried out to the pretreated voice signal, obtains mel-frequency cepstrum coefficient;Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, vocal print vector of the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal is obtained;The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and vocal print testing result is exported according to comparison result;Wherein, the training sample and voice signal are short text.The present invention solves the problems, such as that voice signal is tediously long, amounts of specimen information is big, calculation resources are demanding in existing vocal print detection method.
Description
Technical field
The present invention relates to information technology field more particularly to a kind of vocal print detection method, device, equipment based on short text
And storage medium.
Background technique
Vocal print detection is a kind of personal identification method that common are effect, can be applied to network payment, vocal print lock control, life
A series of scenes needed in conjunction with authentication such as certification, internet of things equipment verifying are deposited, are especially verified not using video image
In convenient remote validation, completely not by equipment limit.When being verified, dual test is carried out using content and vocal print detection
Card can greatly improve the threshold attacked, and promote safety.When carrying out vocal print detection, currently used method includes
But it is not limited to template matching method, probabilistic model method, artificial neural network method, I-vector modelling.However in these methods, by
It in the structure for being limited to model itself, is difficult to complete text training using short text, therefore, it is more feature can only to be generallyd use
Long text is as mode input vector.However, voice signal is more tediously long, the feature of carrying is more, the sample needed in training
It contains much information, the computer resource of occupancy is more.
Summary of the invention
The embodiment of the invention provides a kind of vocal print detection method, device, equipment and storage medium based on short text, with
Solve the problems, such as that voice signal is tediously long, amounts of specimen information is big, calculation resources are demanding in existing vocal print detection method.
A kind of vocal print detection method based on short text, comprising:
Training sample is obtained, preset deep neural network is trained using the training sample;
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed, and feature is carried out to the pretreated voice signal and is mentioned
It takes, obtains mel-frequency cepstrum coefficient;
Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth is obtained
Spend neural network in the output vector of the full articulamentum of the last layer, as the vocal print vector of the voice signal, the vocal print to
The feature of voice signal described in each element representation in amount;
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and according to than
Vocal print testing result is exported to result;
Wherein, the training sample and voice signal are short text.
Optionally, the acquisition training sample is trained preset deep neural network using the training sample
Include:
The speech samples of multiple users are obtained as training sample;
The training sample of each user is pre-processed, feature is carried out to pretreated training sample and is mentioned
It takes, obtains mel-frequency cepstrum coefficient;
User tag is stamped the mel-frequency cepstrum coefficient of each user;
Using with user tag mel-frequency cepstrum coefficient as input vector be passed to preset deep neural network into
Row training;
Each mel-frequency cepstrum coefficient is calculated by the deep neural network using preset loss function
Error between recognition result and corresponding user tag, and modify according to the error parameter of the deep neural network;
The modified depth nerve of parameter is passed to using the mel-frequency cepstrum coefficient with user tag as input vector
Network carries out next iteration training, until the deep neural network is to the recognition result of each mel-frequency cepstrum coefficient
Accuracy rate reaches specified threshold, stops iteration.
Optionally, the deep neural network includes input layer, four layers of full articulamentum and output layer, each full articulamentum
For 12 dimension inputs, using maxout excitation function, and the full articulamentum of third and the 4th full articulamentum are instructed using drop policy
Practice.
Optionally, the vocal print vector by the voice signal compares with the vocal print vector that prestores in sound-groove model library
It is right, and vocal print testing result is exported according to comparison result and includes:
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library;
If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in the sound-groove model library, obtain
The corresponding user information of vocal print vector is prestored described in taking, exports the user information;
If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when,
The prompt information of output detection failure.
Optionally, described that the voice signal to be identified is pre-processed, and the pretreated voice is believed
Number carry out feature extraction, obtaining mel-frequency cepstrum coefficient includes:
Sub-frame processing is executed to the waveform diagram of the voice signal to be identified;
After sub-frame processing, windowing process is executed to each frame signal;
Discrete Fourier transform is executed to each frame signal after windowing process, obtains the corresponding frequency spectrum of the frame signal;
The power spectrum of the voice signal is calculated according to the corresponding frequency spectrum of all frame signals;
According to the spectra calculation Meier filter group;
Logarithm operation is executed the output of each Meier filter, obtains logarithmic energy;
Discrete cosine transform is executed to the logarithmic energy, obtains the mel-frequency cepstrum coefficient of the voice signal.
A kind of vocal print detection device based on short text, comprising:
Training module instructs preset deep neural network using the training sample for obtaining training sample
Practice;
Signal acquisition module, for obtaining voice signal to be identified;
Characteristic extracting module, for being pre-processed to the voice signal to be identified, and to pretreated described
Voice signal carries out feature extraction, obtains mel-frequency cepstrum coefficient;
Feature obtains module, for using the mel-frequency cepstrum coefficient as the incoming depth mind trained in advance of input
Through network, sound of the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal is obtained
Line vector, the feature of voice signal described in each element representation in the vocal print vector;
Detection module, for carrying out the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library
It compares, and vocal print testing result is exported according to comparison result;
Wherein, the training sample and voice signal are short text.
Optionally, the detection module includes:
Comparing unit, for carrying out the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library
It compares;
First result output unit, if for there is the vocal print vector phase with the voice signal in the sound-groove model library
With when prestoring vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information;
Second result output unit, if for there is no the vocal print vectors with the voice signal in the sound-groove model library
It is identical when prestoring vocal print vector, the prompt information of output detection failure.
Optionally, the deep neural network includes input layer, four layers of full articulamentum and output layer, each full articulamentum
For 12 dimension inputs, using maxout excitation function, and the full articulamentum of third and the 4th full articulamentum are instructed using drop policy
Practice.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing
The computer program run on device, the processor realize the above-mentioned vocal print inspection based on short text when executing the computer program
Survey method.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter
Calculation machine program realizes the above-mentioned vocal print detection method based on short text when being executed by processor.
The embodiment of the present invention is suitable for the deep neural network of short text by redesigning in advance, then uses short text
Training sample the preset deep neural network is trained;Obtain voice signal to be identified;To described to be identified
Voice signal pre-processed, and feature extraction is carried out to the pretreated voice signal, obtains mel-frequency cepstrum
Coefficient;Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth is obtained
Neural network the full articulamentum of the last layer output vector, as the vocal print vector of the voice signal, the vocal print vector
In each element representation described in voice signal feature;By the vocal print vector of the voice signal with it is pre- in sound-groove model library
It deposits vocal print vector to be compared, and vocal print testing result is exported according to comparison result;Wherein, the training sample and voice signal
It is short text;To realize the vocal print detection based on short text, the input vector of model is greatly reduced, is solved existing
Voice signal is tediously long in sound marks detection method, amounts of specimen information is big, the demanding problem of calculation resources.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is a flow chart of the vocal print detection method in one embodiment of the invention based on short text;
Fig. 2 is a flow chart of step S101 in vocal print detection method in one embodiment of the invention based on short text;
Fig. 3 is a flow chart of step S103 in vocal print detection method in one embodiment of the invention based on short text;
Fig. 4 is a flow chart of step S105 in vocal print detection method in one embodiment of the invention based on short text;
Fig. 5 is a functional block diagram of the vocal print detection device in one embodiment of the invention based on short text;
Fig. 6 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Vocal print detection method provided in an embodiment of the present invention based on short text is applied to server.The server can be with
It is realized with the server cluster of the either multiple server compositions of independent server.In one embodiment, as shown in Figure 1,
A kind of vocal print detection method based on short text is provided, is included the following steps:
In step s101, training sample is obtained, preset deep neural network is instructed using the training sample
Practice.
Herein, the embodiment of the present invention has redesigned the deep neural network suitable for short text, the depth nerve
Network includes input layer, four layers of full articulamentum and output layer, and each full articulamentum is 12 dimension inputs, using maxout excitation function
Number, and the full articulamentum of third and the 4th full articulamentum are trained using drop policy.In this way, the deep neural network can
It, can be using short text as training sample, input vector, to reduce the requirement to data not limited by model structure.Wherein,
The shorter voice signal of short text, that is, length.For example, the voice signal of a sentence length.It is alternatively possible to pass through specified length
Spend come the short text, such as less than or equal to the designated length voice signal.The embodiment of the present invention collects multiple users
Speech samples preset deep neural network is trained as training sample, and based on the training sample.Optionally,
As shown in Fig. 2, the step S101 includes:
In step s 201, the speech samples of multiple users are obtained as training sample.
In the present embodiment, for practical application scene, multiple users couple can be collected under concrete application scene in advance
The speech samples answered, for example the corresponding voice sample of each user can be collected by channels such as specialized knowledge base, network data bases
This, as training sample.
In step S202, the training sample of each user is pre-processed, to pretreated trained sample
This progress feature extraction obtains MFCC feature.
Herein, MFCC feature (Mel Cepstral Frequency Coefficients, Mel-scale Frequency Cepstral
Coefficients, abbreviation MFCC) it is a kind of ingredient in voice signal with identification, it is to be extracted in Mel scale frequency domain
Cepstrum parameter out, parameter consider human ear to the impression degree of different frequency, especially suitable for voice recognition and language
Person's identification.The embodiment of the present invention is based on the MFCC characteristic Design deep neural network, refreshing using the MFCC feature as depth
Input through network.Before the training deep neural network, user's sample is pre-processed first and feature mentions
It takes, obtains corresponding MFCC feature.The training sample of the user is pre-processed and feature extraction is identical as step S103,
The narration of step S103 is specifically referred to, details are not described herein again.
The embodiment of the present invention obtains the training sample by carrying out feature extraction to the pretreated training sample
Corresponding one group of 128 dimension MFCC feature.Input vector of the 128 dimension MFCC feature as the deep neural network.
In step S203, user tag is stamped the MFCC feature of each user.
In embodiments of the present invention, the user tag is for identifying speaker belonging to the MFCC feature.Difference is used
Family, the user tag that corresponding MFCC feature is beaten are different.Before to deep neural network training, need to each described
The 128 dimension MFCC features of user stamp corresponding user tag.In order to make it easy to understand, illustrated below.Assuming that there are three
User, user 1, user 2, user 3 stamp user tag " 01 " by MFCC feature of the step S203 to user 1, to user 2
MFCC feature stamp user tag " 02 ", user tag " 03 " is stamped to the MFCC feature of user 3.More than it should be appreciated that only
It for an example of the invention, is not intended to restrict the invention, in other embodiments, the user tag can also be other
The label of form.
In step S204, the MFCC feature with user tag is passed to preset depth nerve net as input vector
Network is trained.
In training, for each user, the 128 dimension MFCC features with same user tag are inputted as one
Vector is passed to preset deep neural network and is trained, obtains the recognition result of the user.
Herein, the preset deep neural network includes input layer, four layers of full articulamentum and output layer.It is each complete
Articulamentum is 12 dimension inputs, uses maxout excitation function, the output expression formula of hidden layer node are as follows:
In above formula, b indicates bias, and W indicates the three-dimensional matrice being made of parameter, and having a size of d × m × k, d indicates defeated
Enter the node number of layer, m indicates the node number of hidden layer, and k indicates the node of the corresponding hidden hidden layer of each hidden layer node
The node of number, the k hidden hidden layers is all linear convergent rate.
Each node of maxout excitation function is the maximum value taken in the k hidden hidden layer node output valves.
In embodiments of the present invention, the node number m of each full articulamentum is each of 12,12 nodes node,
The maximum value in k hidden hidden layer node output valves for taking maxout excitation function to generate, it is corresponding most to combine 12 nodes
Big value, the output vector as the full articulamentum.The embodiment of the present invention is by using maxout excitation function, so that depth is neural
The full articulamentum of network is non-linear conversion.
Further, in embodiments of the present invention, the deep neural network includes four layers of full articulamentum, is denoted as the respectively
One full articulamentum, the second full articulamentum, the full articulamentum of third, the 4th full articulamentum.When being trained, had by described in first
The MFCC feature of user tag passes through the first full articulamentum, then using the output vector of the first full articulamentum as the second full connection
The input vector of layer connects third using the output vector of the second full articulamentum as the input vector of the full articulamentum of third entirely
Input vector of the output vector of layer as the 4th full articulamentum, using the output vector of the 4th full articulamentum as the defeated of output layer
Incoming vector.When the full articulamentum of third and the 4th full articulamentum are trained, the embodiment of the present invention uses drop policy, i.e.,
Dropout strategy.It is random according to default first drop probability when the output vector of second full articulamentum is passed to third full articulamentum
Abandon the element in the output vector of the full articulamentum of third.These elements " are erased " from network it should be appreciated that abandoning and referring to,
It is equivalent in this training, the element of these " being erased " is not involved in this training.Then using the full articulamentum of third
Maxout excitation function is trained remaining element, generates the output vector of the full articulamentum of third.According still further to default second
The element in output vector that the full articulamentum of drop probability random drop third obtains, by the connection entirely of remaining element input the 4th
Layer is trained.Herein, first drop probability and the second drop probability are set according to actual needs, the embodiment of the present invention
Preferably 0.5.By using dropout strategy, the simultaneous adaptation between hidden layer node is effectively attenuated, is enhanced extensive
Ability is conducive to the training effect for promoting deep neural network to prevent deep neural network over-fitting in the training process
Fruit.
In step S205, each MFCC feature is calculated using preset loss function and passes through the depth nerve net
Error between the recognition result of network and corresponding user tag, and modify according to the error ginseng of the deep neural network
Number.
The deep neural network is after four layers of full articulamentum, and the output vector of the 4th full articulamentum is as output layer
Input.Output layer is softmax layers, and softmax layers can classify according to the output vector of the 4th full articulamentum, obtain
The recognition result of MFCC feature.The recognition result is that the deep neural network predicts user belonging to the MFCC feature.
As previously mentioned, each full articulamentum uses maxout excitation function, maxout excitation function includes a three-dimensional parameter matrix W
With bias b.The corresponding knowledge of the MFCC feature is being obtained to the training of each MFCC feature by step S204 completion
After other result, calculated between the recognition result and corresponding user tag of each MFCC feature using preset loss function
Error, and based on the error return to modify maxout excitation function in the deep neural network parameter matrix W and
Bias b.Optionally, the loss function includes but is not limited to cross-entropy loss function, quadratic loss function.
In step S206, the modified depth of parameter is passed to using the MFCC feature with user tag as input vector
Neural network carries out next iteration training, until the deep neural network is to the accurate of the recognition result of every MFCC feature
Rate reaches specified threshold, stops iteration.
Deep neural network after modifying parameter by step S205 will have user for being trained next time
The MFCC feature of label as input vector again be passed to the modified deep neural network of parameter be trained, training process and
Step S204's is identical, and referring specifically to narration above, details are not described herein again.Iteration step S204, S205, S206, directly
Specified threshold is reached to the accuracy rate of the recognition result of the MFCC feature of all users to the deep neural network, i.e., the described depth
The recognition result probability identical with corresponding user tag of the degree each MFCC feature of neural network reaches the specified threshold
Value, then illustrate that the parameters in the deep neural network have been adjusted to position, determine that the deep neural network has been trained
It completes, stops iteration.
Trained deep neural network can be used for extracting vocal print vector to voice signal.
In step s 102, voice signal to be identified is obtained.
The voice signal to be identified is short text, i.e. the shorter voice signal of length, such as sentence length
Voice signal, to reduce the requirement to data.In identification process each time, acquired voice signal to be identified is should be
One user's to be identified.The voice signal to be identified can be a voice signal or a plurality of voice signal.
In step s 103, the voice signal to be identified is pre-processed, and to the pretreated voice
Signal carries out feature extraction, obtains MFCC feature.
Before using deep neural network, feature extraction is carried out to voice signal to be identified first, is obtained corresponding
MFCC feature.Optionally, as shown in figure 3, the step S103 includes:
In step S301, sub-frame processing is executed to the waveform diagram of the voice signal to be identified.
Herein, sub-frame processing refers to that the waveform diagram by the voice signal of indefinite length is cut into the fixed segment of length,
Usually taking 10-30 milliseconds is a frame.Since voice signal is fast-changing, and Fourier transformation is suitable for analysis smoothly letter
Number.Framing is carried out by the waveform diagram to voice signal, the intensity of secondary lobe after Fourier transformation can be reduced, improve the frequency of acquisition
Compose quality.
In step s 302, after sub-frame processing, windowing process is executed to each frame signal.
The embodiment of the present invention is by carrying out windowing process to each frame signal, with the smooth voice signal.It is alternatively possible to
It is subject to smoothly using Hamming window, compared to rectangular window function, Hamming window strengthens the continuity of voice signal left end and right end, can
Effectively to weaken the intensity and spectral leakage of secondary lobe after Fourier transformation.
In step S303, discrete Fourier transform is executed to each frame signal after windowing process, obtains the frame signal
Corresponding frequency spectrum.
Since the variation of voice signal in the time domain is difficult to find out the characteristic of voice signal, it is therefore desirable to turn voice signal
The Energy distribution on frequency domain is changed into observe.Different Energy distributions indicates the characteristic of different phonetic.Believe to each frame voice
After number carrying out windowing process, then discrete Fourier transform is carried out, obtain Energy distribution of the frame signal on frequency spectrum.To framing plus
Each frame signal after window carries out discrete Fourier transform and obtains the frequency spectrum of each frame, and then obtains the frequency spectrum of voice signal.
In step s 304, the power spectrum of the voice signal is calculated according to the corresponding frequency spectrum of all frame signals.
After completing discrete Fourier transform, obtained Energy distribution is frequency-region signal.The energy of each frequency range
Not of uniform size, the energy spectrum of different phonemes is also different, needs to obtain institute's predicate to the frequency spectrum modulus square of the voice signal
The power spectrum of sound signal.
In step S305, according to the spectra calculation Meier filter group.
Herein, Meier filter group is the filter group of one group of nonlinear Distribution, densely distributed in low frequency part,
High frequency section distribution is sparse, can preferably meet human hearing characteristic.One group is included n triangle filtering by the embodiment of the present invention
The filter group of device is applied to the voice signal, i.e., by the power spectrum of the voice signal multiplied by one group of n triangular filter,
To convert n-dimensional vector for the power spectrum of the voice signal.Herein, the triangular filter is capable of the work of harmonic carcellation
With highlighting the formant of original voice signal, and then reduce data volume.
In step S306, logarithm operation is executed the output of each Meier filter, obtains logarithmic energy.
Pass through the Meier filtering that each of the obtained n-dimensional vector of step S305 element is in Meier filter group
The output of device, the embodiment of the present invention further are carried out taking logarithm operation, be obtained to each of obtained n-dimensional vector element
The logarithmic energy of the Meier filter group output, i.e. log-mel filer bank energies.The logarithmic energy application
In subsequent carry out cepstral analysis.
In step S307, discrete cosine transform is executed to the logarithmic energy, the MFCC for obtaining the voice signal is special
Sign.
By obtaining the logarithmic energy of the voice signal to above-mentioned steps S306, the embodiment of the present invention is to the logarithm
Energy carries out discrete cosine transform, and takes the coefficient of low 128 dimension in output result, and the MFCC as the voice signal is special
Sign.Herein, there is good energy accumulating effect by the output result that discrete cosine transform obtains, biggish value concentrates on
Close to the upper left corner low energy part, rest part generate a large amount of 0 or close to 0 number.The embodiment of the present invention takes output to tie
The value of low 128 dimension in fruit, as MFCC feature, so as to further amount of compressed data.
Wherein, property of the MFCC feature independent of signal does not do any limitation, Shandong with higher to input signal
Stick meets the sense of hearing coefficient of human ear, still has preferable recognition performance when signal-to-noise ratio reduces, with MFCC feature work
It for the sound characteristic of the voice signal to be identified, is passed in deep neural network and is identified, depth nerve can be improved
The accuracy of Network Recognition.
In step S104, using the MFCC feature as the incoming trained deep neural network in advance of input, obtain
The deep neural network the full articulamentum of the last layer output vector, it is described as the vocal print vector of the voice signal
The feature of voice signal described in each element representation in vocal print vector.
After obtaining the MFCC feature of the voice signal, it is passed to the MFCC feature as input to preparatory training
Good deep neural network is based on the MFCC feature by the deep neural network and identifies to the voice signal.
Herein, described in trained deep neural network includes in advance four layers of full articulamentum, and each full articulamentum of layer includes 12
A node obtains the output vector of one 12 dimension by excitation function maxout function.When deep neural network completion pair
After the identification of the voice signal, the neural network is obtained in the output vector of the full articulamentum of the last layer, as institute's predicate
The d-vector vector of sound signal.The d-vector vector is the vocal print vector of the voice signal, each element therein
Indicate the vocal print feature of the voice signal.
In step s105, the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library is carried out
It compares, and vocal print testing result is exported according to comparison result.
Herein, the sound-groove model library combines the application scenarios of authentication to be configured as needed, such as network
Payment, vocal print lock control, existence certification etc..Have in the sound-groove model library it is multiple prestore vocal print vector and its corresponding user letter
Breath.In specific application scenarios, the deep neural network is first passed through in advance, the user authenticated is identified, mentioned
Vocal print vector is taken, and typing is into the sound-groove model library.
It, will be in the vocal print vector of the voice signal to be identified and the sound-groove model library when carrying out vocal print detection
It prestores vocal print vector to be compared, to execute the language person discrimination to the voice signal.Optionally, as shown in figure 4, the step
S105 includes:
In step S401, the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library is carried out
It compares.
Herein, each in the vocal print vector of the voice signal and sound-groove model library is prestored sound by the embodiment of the present invention
Line vector is compared, and judges whether the element in the two is identical.
In step S402, if there is prestore identical with the vocal print vector of the voice signal in the sound-groove model library
When vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information.
If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in sound-groove model library, show institute
The speaker for stating voice signal to be identified has been entered into sound-groove model library, and the voice signal belongs to the sound-groove model library
In the user that has authenticated, prestore the corresponding user information of vocal print vector described in acquisition, export the user information, to complete pair
The identification of the voice signal to be identified.
In step S403, if there is no identical pre- with the vocal print vector of the voice signal in the sound-groove model library
When depositing vocal print vector, the prompt information of output detection failure.
If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when,
Show that the speaker of the voice signal to be identified is not entered into sound-groove model library, the voice signal is not belonging to the sound
The user authenticated in line model library, the then prompt information that output verification fails.
Vocal print detection method described in the embodiment of the present invention based on short text can be applied to network payment, vocal print lock control,
A series of application scenarios needed in conjunction with authentication such as existence certification, it can also be used in internet of things equipment verifying.Especially exist
Using in the inconvenient remote validation of video image verifying, is not limited completely by equipment, identity can be confirmed by phone, it can
Greatly to reduce the cost of remote validation.
In conclusion the embodiment of the present invention is suitable for the deep neural network of short text by redesigning in advance, then
Preset deep neural network is trained using the training sample of short text;When carrying out vocal print detection, obtain to be identified
Voice signal, the voice signal be short text;The voice signal to be identified is pre-processed, and to pretreatment after
The voice signal carry out feature extraction, obtain MFCC feature;The MFCC feature is passed to as input and is trained in advance
Deep neural network, obtain the deep neural network in the output vector of the full articulamentum of the last layer, as the voice
The vocal print vector of signal, the feature of voice signal described in each element representation in the vocal print vector;By the voice signal
Vocal print vector be compared with the vocal print vector that prestores in sound-groove model library, and according to comparison result export vocal print detection knot
Fruit;To realize the vocal print detection based on short text, the input vector of model is greatly reduced, solves existing vocal print inspection
Voice signal is tediously long in survey method, amounts of specimen information is big, the demanding problem of calculation resources.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of vocal print detection device based on short text is provided, should be detected based on the vocal print of short text
Vocal print detection method in device and above-described embodiment based on short text corresponds.As shown in figure 5, should the sound based on short text
Line detection device includes training module, data obtaining module, characteristic extracting module, feature acquisition module, detection module.Each function
Detailed description are as follows for module:
Training module 51 is used for training module, for obtaining training sample, using the training sample to preset depth
Neural network is trained;
Signal acquisition module 52, for obtaining voice signal to be identified;
Characteristic extracting module 53, for being pre-processed to the voice signal to be identified, and to pretreated institute
Predicate sound signal carries out feature extraction, obtains mel-frequency cepstrum coefficient;
Feature obtains module 54, for using the mel-frequency cepstrum coefficient as the incoming trained depth in advance of input
Neural network obtains the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal
Vocal print vector, the feature of voice signal described in each element representation in the vocal print vector;
Detection module 55, for by the vocal print vector of the voice signal and sound-groove model library prestore vocal print vector into
Row compares, and exports vocal print testing result according to comparison result;
Wherein, the training sample and voice signal are short text.
Optionally, the training module 51 includes:
Sample acquisition unit, for obtaining the speech samples of multiple users as training sample;
Feature extraction unit is pre-processed for the training sample each user, to pretreated instruction
Practice sample and carry out feature extraction, obtains mel-frequency cepstrum coefficient;
Tag unit stamps user tag for the mel-frequency cepstrum coefficient each user;
Training unit, for the mel-frequency cepstrum coefficient for having user tag to be passed to preset depth as input vector
Degree neural network is trained;
Parameter modifying unit passes through institute for calculating each mel-frequency cepstrum coefficient using preset loss function
The error between the recognition result of deep neural network and corresponding user tag is stated, and the depth is modified according to the error
The parameter of neural network;
The training unit is also used to, using the mel-frequency cepstrum coefficient with user tag as the incoming ginseng of input vector
The modified deep neural network of number carries out next iteration training, until the deep neural network falls each mel-frequency
The accuracy rate of the recognition result of spectral coefficient reaches specified threshold, stops iteration.
Optionally, the deep neural network includes input layer, four layers of full articulamentum and output layer, each full articulamentum
For 12 dimension inputs, using maxout excitation function, and the full articulamentum of third and the 4th full articulamentum are carried out using dropout strategy
Training.
Optionally, the characteristic extracting module 53 includes:
Framing unit executes sub-frame processing for the waveform diagram to the voice signal to be identified;
Windowing unit, for executing windowing process to each frame signal after sub-frame processing;
Converter unit obtains the frame signal for executing discrete Fourier transform to each frame signal after windowing process
Corresponding frequency spectrum;
Spectra calculation unit, for calculating the power spectrum of the voice signal according to the corresponding frequency spectrum of all frame signals;
Filter group computing unit, for according to the spectra calculation Meier filter group;
To counting unit, logarithm operation is executed for the output each Meier filter, obtains logarithmic energy;
Cosine transform unit obtains the plum of the voice signal for executing discrete cosine transform to the logarithmic energy
That frequency cepstral coefficient.
Optionally, the detection module 55 includes:
Comparing unit, for carrying out the vocal print vector that prestores in the vocal print vector of the voice signal and sound-groove model library
It compares;
First result output unit, if for there is the vocal print vector phase with the voice signal in the sound-groove model library
With when prestoring vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information;
Second result output unit, if for there is no the vocal print vectors with the voice signal in the sound-groove model library
It is identical when prestoring vocal print vector, the prompt information of output detection failure.
Specific restriction about the vocal print detection device based on short text may refer to above for based on short text
The restriction of vocal print detection method, details are not described herein.Modules in the above-mentioned vocal print detection device based on short text can be complete
Portion or part are realized by software, hardware and combinations thereof.Above-mentioned each module can be embedded in the form of hardware or independently of calculating
In processor in machine equipment, it can also be stored in a software form in the memory in computer equipment, in order to processor
It calls and executes the corresponding operation of the above modules.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 6.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The network interface of machine equipment is used to communicate with external terminal by network connection.When the computer program is executed by processor with
Realize a kind of vocal print detection method based on short text.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor perform the steps of when executing computer program
Training sample is obtained, preset deep neural network is trained using the training sample;
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed, and feature is carried out to the pretreated voice signal and is mentioned
It takes, obtains MFCC feature;
Using the MFCC feature as the incoming trained deep neural network in advance of input, the depth nerve net is obtained
Network the full articulamentum of the last layer output vector, it is each in the vocal print vector as the vocal print vector of the voice signal
The feature of voice signal described in a element representation;
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and according to than
Vocal print testing result is exported to result;
Wherein, the training sample and voice signal are short text.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated
Machine program performs the steps of when being executed by processor
Training sample is obtained, preset deep neural network is trained using the training sample;
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed, and feature is carried out to the pretreated voice signal and is mentioned
It takes, obtains MFCC feature;
Using the MFCC feature as the incoming trained deep neural network in advance of input, the depth nerve net is obtained
Network the full articulamentum of the last layer output vector, it is each in the vocal print vector as the vocal print vector of the voice signal
The feature of voice signal described in a element representation;
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and according to than
Vocal print testing result is exported to result;
Wherein, the training sample and voice signal are short text.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided by the present invention,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of vocal print detection method based on short text characterized by comprising
Training sample is obtained, preset deep neural network is trained using the training sample;
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed, and feature extraction is carried out to the pretreated voice signal,
Obtain mel-frequency cepstrum coefficient;
Using the mel-frequency cepstrum coefficient as the incoming trained deep neural network in advance of input, the depth mind is obtained
Through network the full articulamentum of the last layer output vector, as the vocal print vector of the voice signal, in the vocal print vector
Each element representation described in voice signal feature;
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library, and is tied according to comparing
Fruit exports vocal print testing result;
Wherein, the training sample and voice signal are short text.
2. the vocal print detection method based on short text as described in claim 1, which is characterized in that the acquisition training sample,
Preset deep neural network is trained using the training sample and includes:
The speech samples of multiple users are obtained as training sample;
The training sample of each user is pre-processed, feature extraction is carried out to pretreated training sample, is obtained
To mel-frequency cepstrum coefficient;
User tag is stamped the mel-frequency cepstrum coefficient of each user;
Mel-frequency cepstrum coefficient with user tag is passed to preset deep neural network as input vector to instruct
Practice;
Identification of each mel-frequency cepstrum coefficient Jing Guo the deep neural network is calculated using preset loss function
As a result the error between corresponding user tag, and modify according to the error parameter of the deep neural network;
The modified deep neural network of parameter is passed to using the mel-frequency cepstrum coefficient with user tag as input vector
Next iteration training is carried out, until the deep neural network is to the accurate of the recognition result of each mel-frequency cepstrum coefficient
Rate reaches specified threshold, stops iteration.
3. the vocal print detection method based on short text as claimed in claim 2, which is characterized in that the deep neural network packet
Input layer, four layers of full articulamentum and output layer are included, each full articulamentum is 12 dimension inputs, using maxout excitation function, and
The full articulamentum of third and the 4th full articulamentum are trained using drop policy.
4. the vocal print detection method as described in any one of claims 1 to 3 based on short text, which is characterized in that described by institute
The vocal print vector of predicate sound signal is compared with the vocal print vector that prestores in sound-groove model library, and according to comparison result output sound
Line testing result includes:
The vocal print vector of the voice signal is compared with the vocal print vector that prestores in sound-groove model library;
If existing identical with the vocal print vector of the voice signal when prestoring vocal print vector in the sound-groove model library, institute is obtained
It states and prestores the corresponding user information of vocal print vector, export the user information;
If in the sound-groove model library there is no it is identical with the vocal print vector of the voice signal prestore vocal print vector when, output
Detect the prompt information of failure.
5. the vocal print detection method as described in any one of claims 1 to 3 based on short text, which is characterized in that described to institute
It states voice signal to be identified to be pre-processed, and feature extraction is carried out to the pretreated voice signal, obtain Meier
Frequency cepstral coefficient includes:
Sub-frame processing is executed to the waveform diagram of the voice signal to be identified;
After sub-frame processing, windowing process is executed to each frame signal;
Discrete Fourier transform is executed to each frame signal after windowing process, obtains the corresponding frequency spectrum of the frame signal;
The power spectrum of the voice signal is calculated according to the corresponding frequency spectrum of all frame signals;
According to the spectra calculation Meier filter group;
Logarithm operation is executed the output of each Meier filter, obtains logarithmic energy;
Discrete cosine transform is executed to the logarithmic energy, obtains the mel-frequency cepstrum coefficient of the voice signal.
6. a kind of vocal print detection device based on short text characterized by comprising
Training module is trained preset deep neural network using the training sample for obtaining training sample;
Signal acquisition module, for obtaining voice signal to be identified;
Characteristic extracting module, for being pre-processed to the voice signal to be identified, and to the pretreated voice
Signal carries out feature extraction, obtains mel-frequency cepstrum coefficient;
Feature obtains module, for using the mel-frequency cepstrum coefficient as the incoming trained depth nerve net in advance of input
Network, obtains the deep neural network in the output vector of the full articulamentum of the last layer, as the voice signal vocal print to
It measures, the feature of voice signal described in each element representation in the vocal print vector;
Detection module, for comparing the vocal print vector of the voice signal and the vocal print vector that prestores in sound-groove model library
It is right, and vocal print testing result is exported according to comparison result;
Wherein, the training sample and voice signal are short text.
7. the vocal print detection device based on short text as claimed in claim 6, which is characterized in that the detection module includes:
Comparing unit, for comparing the vocal print vector of the voice signal and the vocal print vector that prestores in sound-groove model library
It is right;
First result output unit, if identical with the vocal print vector of the voice signal for existing in the sound-groove model library
When prestoring vocal print vector, the corresponding user information of vocal print vector is prestored described in acquisition, exports the user information;
Second result output unit, if for there is no identical as the vocal print vector of the voice signal in the sound-groove model library
When prestoring vocal print vector, output detection failure prompt information.
8. the vocal print detection device based on short text as claimed in claims 6 or 7, which is characterized in that the depth nerve net
Network includes input layer, four layers of full articulamentum and output layer, and each full articulamentum is 12 dimension inputs, using maxout excitation function
Number, and the full articulamentum of third and the 4th full articulamentum are trained using drop policy.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
5 described in any item vocal print detection methods based on short text.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realization such as the vocal print described in any one of claim 1 to 5 based on short text when the computer program is executed by processor
Detection method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910167882.3A CN110010133A (en) | 2019-03-06 | 2019-03-06 | Vocal print detection method, device, equipment and storage medium based on short text |
PCT/CN2019/117731 WO2020177380A1 (en) | 2019-03-06 | 2019-11-13 | Voiceprint detection method, apparatus and device based on short text, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910167882.3A CN110010133A (en) | 2019-03-06 | 2019-03-06 | Vocal print detection method, device, equipment and storage medium based on short text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110010133A true CN110010133A (en) | 2019-07-12 |
Family
ID=67166562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910167882.3A Pending CN110010133A (en) | 2019-03-06 | 2019-03-06 | Vocal print detection method, device, equipment and storage medium based on short text |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110010133A (en) |
WO (1) | WO2020177380A1 (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
CN110751944A (en) * | 2019-09-19 | 2020-02-04 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for constructing voice recognition model |
CN110875043A (en) * | 2019-11-11 | 2020-03-10 | 广州国音智能科技有限公司 | Voiceprint recognition method and device, mobile terminal and computer readable storage medium |
CN110880327A (en) * | 2019-10-29 | 2020-03-13 | 平安科技(深圳)有限公司 | Audio signal processing method and device |
CN111128234A (en) * | 2019-12-05 | 2020-05-08 | 厦门快商通科技股份有限公司 | Spliced voice recognition detection method, device and equipment |
CN111227839A (en) * | 2020-01-19 | 2020-06-05 | 中国电子科技集团公司电子科学研究院 | Behavior identification method and device |
CN111326163A (en) * | 2020-04-15 | 2020-06-23 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, device and equipment |
CN111326161A (en) * | 2020-02-26 | 2020-06-23 | 北京声智科技有限公司 | Voiceprint determination method and device |
CN111341320A (en) * | 2020-02-28 | 2020-06-26 | 中国工商银行股份有限公司 | Phrase voice voiceprint recognition method and device |
CN111341307A (en) * | 2020-03-13 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111462757A (en) * | 2020-01-15 | 2020-07-28 | 北京远鉴信息技术有限公司 | Data processing method and device based on voice signal, terminal and storage medium |
CN111488947A (en) * | 2020-04-28 | 2020-08-04 | 深圳力维智联技术有限公司 | Fault detection method and device for power system equipment |
CN111524522A (en) * | 2020-04-23 | 2020-08-11 | 上海依图网络科技有限公司 | Voiceprint recognition method and system based on fusion of multiple voice features |
CN111583935A (en) * | 2020-04-02 | 2020-08-25 | 深圳壹账通智能科技有限公司 | Loan intelligent delivery method, device and storage medium |
CN111582020A (en) * | 2020-03-25 | 2020-08-25 | 平安科技(深圳)有限公司 | Signal processing method, signal processing device, computer equipment and storage medium |
WO2020177380A1 (en) * | 2019-03-06 | 2020-09-10 | 平安科技(深圳)有限公司 | Voiceprint detection method, apparatus and device based on short text, and storage medium |
CN112071322A (en) * | 2020-10-30 | 2020-12-11 | 北京快鱼电子股份公司 | End-to-end voiceprint recognition method, device, storage medium and equipment |
CN112185347A (en) * | 2020-09-27 | 2021-01-05 | 北京达佳互联信息技术有限公司 | Language identification method, language identification device, server and storage medium |
CN112242137A (en) * | 2020-10-15 | 2021-01-19 | 上海依图网络科技有限公司 | Training of human voice separation model and human voice separation method and device |
CN112259114A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Voice processing method and device, computer storage medium and electronic equipment |
WO2021051608A1 (en) * | 2019-09-20 | 2021-03-25 | 平安科技(深圳)有限公司 | Voiceprint recognition method and device employing deep learning, and apparatus |
CN112562656A (en) * | 2020-12-16 | 2021-03-26 | 咪咕文化科技有限公司 | Signal classification method, device, equipment and storage medium |
CN112562691A (en) * | 2020-11-27 | 2021-03-26 | 平安科技(深圳)有限公司 | Voiceprint recognition method and device, computer equipment and storage medium |
CN112802481A (en) * | 2021-04-06 | 2021-05-14 | 北京远鉴信息技术有限公司 | Voiceprint verification method, voiceprint recognition model training method, device and equipment |
WO2021115176A1 (en) * | 2019-12-09 | 2021-06-17 | 华为技术有限公司 | Speech recognition method and related device |
CN113223536A (en) * | 2020-01-19 | 2021-08-06 | Tcl集团股份有限公司 | Voiceprint recognition method and device and terminal equipment |
CN113407768A (en) * | 2021-06-24 | 2021-09-17 | 深圳市声扬科技有限公司 | Voiceprint retrieval method, device, system, server and storage medium |
CN113470653A (en) * | 2020-03-31 | 2021-10-01 | 华为技术有限公司 | Voiceprint recognition method, electronic equipment and system |
CN114003885A (en) * | 2021-11-01 | 2022-02-01 | 浙江大学 | Intelligent voice authentication method, system and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115223568A (en) * | 2022-06-29 | 2022-10-21 | 厦门快商通科技股份有限公司 | Identity verification method, device and system based on voiceprint recognition and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105869644A (en) * | 2016-05-25 | 2016-08-17 | 百度在线网络技术(北京)有限公司 | Deep learning based voiceprint authentication method and device |
US20160372121A1 (en) * | 2015-06-17 | 2016-12-22 | Baidu Online Network Technology (Beijing) Co., Ltd. | Voiceprint authentication method and apparatus |
CN108369813A (en) * | 2017-07-31 | 2018-08-03 | 深圳和而泰智能家居科技有限公司 | Specific sound recognition methods, equipment and storage medium |
CN108417217A (en) * | 2018-01-11 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Speaker Identification network model training method, method for distinguishing speek person and system |
CN108877812A (en) * | 2018-08-16 | 2018-11-23 | 桂林电子科技大学 | A kind of method for recognizing sound-groove, device and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10540979B2 (en) * | 2014-04-17 | 2020-01-21 | Qualcomm Incorporated | User interface for secure access to a device using speaker verification |
CN105788592A (en) * | 2016-04-28 | 2016-07-20 | 乐视控股(北京)有限公司 | Audio classification method and apparatus thereof |
CN107808664B (en) * | 2016-08-30 | 2021-07-30 | 富士通株式会社 | Sparse neural network-based voice recognition method, voice recognition device and electronic equipment |
CN107610707B (en) * | 2016-12-15 | 2018-08-31 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
CN107527620B (en) * | 2017-07-25 | 2019-03-26 | 平安科技(深圳)有限公司 | Electronic device, the method for authentication and computer readable storage medium |
CN110010133A (en) * | 2019-03-06 | 2019-07-12 | 平安科技(深圳)有限公司 | Vocal print detection method, device, equipment and storage medium based on short text |
-
2019
- 2019-03-06 CN CN201910167882.3A patent/CN110010133A/en active Pending
- 2019-11-13 WO PCT/CN2019/117731 patent/WO2020177380A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160372121A1 (en) * | 2015-06-17 | 2016-12-22 | Baidu Online Network Technology (Beijing) Co., Ltd. | Voiceprint authentication method and apparatus |
CN105869644A (en) * | 2016-05-25 | 2016-08-17 | 百度在线网络技术(北京)有限公司 | Deep learning based voiceprint authentication method and device |
CN108369813A (en) * | 2017-07-31 | 2018-08-03 | 深圳和而泰智能家居科技有限公司 | Specific sound recognition methods, equipment and storage medium |
CN108417217A (en) * | 2018-01-11 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Speaker Identification network model training method, method for distinguishing speek person and system |
CN108877812A (en) * | 2018-08-16 | 2018-11-23 | 桂林电子科技大学 | A kind of method for recognizing sound-groove, device and storage medium |
Non-Patent Citations (1)
Title |
---|
EHSAN VARIANI等: "DEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION", 《IEEE XPLORE》 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020177380A1 (en) * | 2019-03-06 | 2020-09-10 | 平安科技(深圳)有限公司 | Voiceprint detection method, apparatus and device based on short text, and storage medium |
CN110751944A (en) * | 2019-09-19 | 2020-02-04 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for constructing voice recognition model |
WO2021051608A1 (en) * | 2019-09-20 | 2021-03-25 | 平安科技(深圳)有限公司 | Voiceprint recognition method and device employing deep learning, and apparatus |
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
CN110880327A (en) * | 2019-10-29 | 2020-03-13 | 平安科技(深圳)有限公司 | Audio signal processing method and device |
CN110875043A (en) * | 2019-11-11 | 2020-03-10 | 广州国音智能科技有限公司 | Voiceprint recognition method and device, mobile terminal and computer readable storage medium |
CN110875043B (en) * | 2019-11-11 | 2022-06-17 | 广州国音智能科技有限公司 | Voiceprint recognition method and device, mobile terminal and computer readable storage medium |
CN111128234A (en) * | 2019-12-05 | 2020-05-08 | 厦门快商通科技股份有限公司 | Spliced voice recognition detection method, device and equipment |
WO2021115176A1 (en) * | 2019-12-09 | 2021-06-17 | 华为技术有限公司 | Speech recognition method and related device |
CN111462757A (en) * | 2020-01-15 | 2020-07-28 | 北京远鉴信息技术有限公司 | Data processing method and device based on voice signal, terminal and storage medium |
CN111462757B (en) * | 2020-01-15 | 2024-02-23 | 北京远鉴信息技术有限公司 | Voice signal-based data processing method, device, terminal and storage medium |
CN111227839B (en) * | 2020-01-19 | 2023-08-18 | 中国电子科技集团公司电子科学研究院 | Behavior recognition method and device |
CN113223536B (en) * | 2020-01-19 | 2024-04-19 | Tcl科技集团股份有限公司 | Voiceprint recognition method and device and terminal equipment |
CN113223536A (en) * | 2020-01-19 | 2021-08-06 | Tcl集团股份有限公司 | Voiceprint recognition method and device and terminal equipment |
CN111227839A (en) * | 2020-01-19 | 2020-06-05 | 中国电子科技集团公司电子科学研究院 | Behavior identification method and device |
CN111326161B (en) * | 2020-02-26 | 2023-06-30 | 北京声智科技有限公司 | Voiceprint determining method and device |
CN111326161A (en) * | 2020-02-26 | 2020-06-23 | 北京声智科技有限公司 | Voiceprint determination method and device |
CN111341320A (en) * | 2020-02-28 | 2020-06-26 | 中国工商银行股份有限公司 | Phrase voice voiceprint recognition method and device |
CN111341320B (en) * | 2020-02-28 | 2023-04-14 | 中国工商银行股份有限公司 | Phrase voice voiceprint recognition method and device |
CN111341307A (en) * | 2020-03-13 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111582020A (en) * | 2020-03-25 | 2020-08-25 | 平安科技(深圳)有限公司 | Signal processing method, signal processing device, computer equipment and storage medium |
CN113470653A (en) * | 2020-03-31 | 2021-10-01 | 华为技术有限公司 | Voiceprint recognition method, electronic equipment and system |
CN111583935A (en) * | 2020-04-02 | 2020-08-25 | 深圳壹账通智能科技有限公司 | Loan intelligent delivery method, device and storage medium |
CN111326163A (en) * | 2020-04-15 | 2020-06-23 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, device and equipment |
CN111524522B (en) * | 2020-04-23 | 2023-04-07 | 上海依图网络科技有限公司 | Voiceprint recognition method and system based on fusion of multiple voice features |
CN111524522A (en) * | 2020-04-23 | 2020-08-11 | 上海依图网络科技有限公司 | Voiceprint recognition method and system based on fusion of multiple voice features |
CN111488947B (en) * | 2020-04-28 | 2024-02-02 | 深圳力维智联技术有限公司 | Fault detection method and device for power system equipment |
CN111488947A (en) * | 2020-04-28 | 2020-08-04 | 深圳力维智联技术有限公司 | Fault detection method and device for power system equipment |
CN112185347A (en) * | 2020-09-27 | 2021-01-05 | 北京达佳互联信息技术有限公司 | Language identification method, language identification device, server and storage medium |
CN112242137A (en) * | 2020-10-15 | 2021-01-19 | 上海依图网络科技有限公司 | Training of human voice separation model and human voice separation method and device |
CN112242137B (en) * | 2020-10-15 | 2024-05-17 | 上海依图网络科技有限公司 | Training of human voice separation model and human voice separation method and device |
CN112259114A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Voice processing method and device, computer storage medium and electronic equipment |
CN112071322A (en) * | 2020-10-30 | 2020-12-11 | 北京快鱼电子股份公司 | End-to-end voiceprint recognition method, device, storage medium and equipment |
CN112562691A (en) * | 2020-11-27 | 2021-03-26 | 平安科技(深圳)有限公司 | Voiceprint recognition method and device, computer equipment and storage medium |
CN112562656A (en) * | 2020-12-16 | 2021-03-26 | 咪咕文化科技有限公司 | Signal classification method, device, equipment and storage medium |
CN112802481A (en) * | 2021-04-06 | 2021-05-14 | 北京远鉴信息技术有限公司 | Voiceprint verification method, voiceprint recognition model training method, device and equipment |
CN113407768A (en) * | 2021-06-24 | 2021-09-17 | 深圳市声扬科技有限公司 | Voiceprint retrieval method, device, system, server and storage medium |
CN113407768B (en) * | 2021-06-24 | 2024-02-02 | 深圳市声扬科技有限公司 | Voiceprint retrieval method, voiceprint retrieval device, voiceprint retrieval system, voiceprint retrieval server and storage medium |
CN114003885A (en) * | 2021-11-01 | 2022-02-01 | 浙江大学 | Intelligent voice authentication method, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020177380A1 (en) | 2020-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110010133A (en) | Vocal print detection method, device, equipment and storage medium based on short text | |
US20200321008A1 (en) | Voiceprint recognition method and device based on memory bottleneck feature | |
Balamurali et al. | Toward robust audio spoofing detection: A detailed comparison of traditional and learned features | |
CN107610707B (en) | A kind of method for recognizing sound-groove and device | |
CN111276131B (en) | Multi-class acoustic feature integration method and system based on deep neural network | |
CN110378228A (en) | Video data handling procedure, device, computer equipment and storage medium are examined in face | |
CN109346086A (en) | Method for recognizing sound-groove, device, computer equipment and computer readable storage medium | |
DE69831076T2 (en) | METHOD AND DEVICE FOR LANGUAGE ANALYSIS AND SYNTHESIS BY ALLPASS-SIEB CHAIN FILTERS | |
CN110232932A (en) | Method for identifying speaker, device, equipment and medium based on residual error time-delay network | |
WO2017218465A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
CN109461073A (en) | Risk management method, device, computer equipment and the storage medium of intelligent recognition | |
CN108922544A (en) | General vector training method, voice clustering method, device, equipment and medium | |
CN109257362A (en) | Method, apparatus, computer equipment and the storage medium of voice print verification | |
CN109065027A (en) | Speech differentiation model training method, device, computer equipment and storage medium | |
CN108922543A (en) | Model library method for building up, audio recognition method, device, equipment and medium | |
CN110246503A (en) | Blacklist vocal print base construction method, device, computer equipment and storage medium | |
CN110570870A (en) | Text-independent voiceprint recognition method, device and equipment | |
CN109448732A (en) | A kind of digit string processing method and processing device | |
Karthikeyan | Adaptive boosted random forest-support vector machine based classification scheme for speaker identification | |
Alashban et al. | Speaker gender classification in mono-language and cross-language using BLSTM network | |
Mandalapu et al. | Multilingual voice impersonation dataset and evaluation | |
Pan et al. | Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection | |
CN116778910A (en) | Voice detection method | |
Mallikarjunan et al. | Text-independent speaker recognition in clean and noisy backgrounds using modified VQ-LBG algorithm | |
Choudhary et al. | Automatic speaker verification using gammatone frequency cepstral coefficients |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190712 |
|
RJ01 | Rejection of invention patent application after publication |