CN109584884A - A kind of speech identity feature extractor, classifier training method and relevant device - Google Patents
A kind of speech identity feature extractor, classifier training method and relevant device Download PDFInfo
- Publication number
- CN109584884A CN109584884A CN201710910880.XA CN201710910880A CN109584884A CN 109584884 A CN109584884 A CN 109584884A CN 201710910880 A CN201710910880 A CN 201710910880A CN 109584884 A CN109584884 A CN 109584884A
- Authority
- CN
- China
- Prior art keywords
- speech
- identity
- voice
- network model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 193
- 238000000034 method Methods 0.000 title claims abstract description 83
- 210000005036 nerve Anatomy 0.000 claims abstract description 90
- 238000003062 neural network model Methods 0.000 claims abstract description 81
- 238000001514 detection method Methods 0.000 claims abstract description 75
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 239000011159 matrix material Substances 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 19
- 239000000284 extract Substances 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 14
- 238000012795 verification Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Optimization (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
- Train Traffic Observation, Control, And Security (AREA)
- Feedback Control In General (AREA)
Abstract
The embodiment of the present invention provides a kind of speech identity feature extractor, classifier training method and relevant device, which includes: to extract the speech feature vector of training voice;According to the speech feature vector of training voice, corresponding I-vector is determined;It is exported using I-vector as the first object of neural network model, the weight of neural network model is adjusted, first nerves network model is obtained;The speech feature vector for obtaining target detection voice, determines first nerves network model to the output result of the speech feature vector of target detection voice;According to output as a result, determining identity factor hidden variable;The Posterior Mean for estimating identity factor hidden variable is exported using Posterior Mean as the second target of first nerves network model, is adjusted the weight of first nerves network model, is obtained speech identity feature extractor.It can train to obtain novel speech identity feature extractor through the embodiment of the present invention, the extraction for the novel speech identity feature of high reliability provides possibility.
Description
Technical field
The present invention relates to voice technology fields, and in particular to a kind of speech identity feature extractor, classifier training method
And relevant device.
Background technique
Voice is since the characteristics such as acquisition is easy, is easy to store, is difficult to imitate are in more and more identification scenes
To application, solve many information security issues to be related to the place of Information Security.Voice-based speaker's identity
Identification can be divided into speaker and recognize (Speaker Identification) and speaker verification (Speaker
Verification) two class;Speaker's identification is mainly based upon the voice to be measured that speaker says, and judges whether speaker belongs to
One in registered speaker's set, be one-to-many identification problem;Speaker verification be said based on speaker to
Voice is surveyed, judges whether speaker is a registered target speaker, is one-to-one confirmation problem.
When carrying out speaker's identity identification based on voice, the voice based on speaker is needed to extract expression speaker's identity
The speech identity feature of information is handled the speech identity feature by classifier trained in advance, is spoken to realize
The identification of people.Currently, speech identity feature is mainly used as using I-vector (the identity factor, Identity-vector),
It is that the currently used speech identity for carrying out speaker's identity identification is special although I-vector is able to reflect speaker's acoustic difference
Sign, but the inventors found that: the reliability of I-vector is established on requiring more stringent voice, in voice duration
It is shorter etc. it is undesirable in the case where, the reliability of I-vector will can be greatly reduced.
Therefore how a kind of novel speech identity feature extractor is provided, realize the novel language for being different from I-vector
The extraction of sound identity characteristic becomes that those skilled in the art are in need of consideration to be asked to promote the reliability of speech identity feature
Topic.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of speech identity feature extractor, classifier training method and correlation
Equipment realizes the extraction of the novel speech identity feature of high reliability to provide novel speech identity feature extractor;Into
One step realizes speaker's identity identification based on the novel speech identity feature, promotes the accuracy of speaker's identity identification.
To achieve the above object, the embodiment of the present invention provides the following technical solutions:
A kind of speech identity feature extractor training method, comprising:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out
Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection
The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model
The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
The embodiment of the present invention also provides a kind of classifier training method, comprising:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training
Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor
Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
The embodiment of the present invention also provides a kind of speech identity feature extractor training device, comprising:
The first extraction module of speech feature vector, for extracting the speech feature vector of trained voice;
Identity factor determining module determines the trained voice for the speech feature vector according to the trained voice
Corresponding I-vector;
First training module, for the first object output using the I-vector as neural network model, to nerve
The weight of network model is adjusted, and obtains first nerves network model;
First result determining module determines the first nerves for obtaining the speech feature vector of target detection voice
Output result of the network model to the speech feature vector of the target detection voice;
Hidden variable determining module, for being exported according to described as a result, determining identity factor hidden variable;
Second training module, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean described in
Second target of first nerves network model exports, and adjusts the weight of the first nerves network model, obtains speech identity spy
Levy extractor.
The embodiment of the present invention also provides a kind of electronic equipment, comprising: at least one processor;The memory is stored with can
The program of execution, described program are used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out
Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection
The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model
The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
The embodiment of the present invention also provides a kind of classifier training device, comprising:
Target detection voice obtains module, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module examines the target for calling the speech identity feature extractor of pre-training
The speech feature vector for surveying voice inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, institute's predicate
Sound identity characteristic extractor is obtained by target output training of identity factor hidden variable;
Training module, for according to speech identity feature training classifier.
The embodiment of the present invention also provides a kind of electronic equipment, comprising: at least one processor;The memory is stored with can
The program of execution, described program are used for:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training
Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor
Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
Based on the above-mentioned technical proposal, speech identity feature extractor training method provided in an embodiment of the present invention includes: to mention
Take the speech feature vector of trained voice;According to the speech feature vector of the trained voice, determine that the trained voice is corresponding
I-vector;It is exported using the I-vector as the first object of neural network model, to the weight of neural network model
It is adjusted, obtains first nerves network model;After obtaining first nerves network model, the language of target detection voice can be obtained
Sound feature vector determines the first nerves network model to the output knot of the speech feature vector of the target detection voice
Fruit, thus according to the output as a result, determining identity factor hidden variable;The Posterior Mean of identity factor hidden variable is estimated, with institute
The second target that Posterior Mean is stated as neural network model exports, and adjusts the weight of neural network model, obtains speech identity
Feature extractor realizes the training of novel speech identity feature extractor.
Speech identity feature extractor training method provided in an embodiment of the present invention, be based on neural network model, with comprising
More compact speaker information, the Posterior Mean of the identity factor hidden variable with high reliability are target, and training obtains voice
Identity characteristic extractor may make the speech identity feature extracted by the speech identity feature extractor with higher reliable
Property, the requirement to voice can be reduced.The training method provided through the embodiment of the present invention can train to obtain novel speech identity
Feature extractor, the extraction for the novel speech identity feature of high reliability provide possibility.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart of speech identity feature extractor training method provided in an embodiment of the present invention;
Fig. 2 is to carry out pretreated process schematic to training voice;
Fig. 3 is the method flow diagram for determining the training corresponding I-vector of voice;
Fig. 4 is that the layering of neural network model initializes schematic diagram;
Fig. 5 is that training obtains the method flow diagram of first nerves network model;
Fig. 6 is that training obtains the process schematic of first nerves network model;
Fig. 7 is to obtain the method flow diagram of speech identity feature extractor based on the training of first nerves network model;
Fig. 8 is that training obtains the process schematic of speech identity feature extractor;
Fig. 9 is the process schematic of training F-vector extractor on the neural network model of layering initialization;
Figure 10 is classifier training method flow diagram provided in an embodiment of the present invention;
Figure 11 is the method flow diagram according to speech identity feature training classifier;
Figure 12 is the simplification process schematic of training extractor and classifier of the embodiment of the present invention;
Figure 13 is the structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention;
Figure 14 is another structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention;
Figure 15 is the hardware block diagram of electronic equipment;
Figure 16 is the structural block diagram of classifier training device provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of speech identity feature extractor training method provided in an embodiment of the present invention, passes through the voice
Identity characteristic extractor training method can train to obtain novel speech identity feature extractor provided in an embodiment of the present invention, base
It is special that the higher speech identity for being different from I-vector of reliability can be extracted from voice in the speech identity feature extractor
Sign;
Method shown in Fig. 1 can be applied to the electronic equipment with data-handling capacity, and electronic equipment can be arranged such as network side
Server, user equipmenies, the form of electronic equipment such as mobile phone, PC (personal computer) that user side uses specifically visually use
Depending on demand;It is corresponding that the embodiment of the present invention can load the speech identity feature extractor training method in the electronic equipment
Program realizes the execution of speech identity feature extractor training method provided in an embodiment of the present invention;
Referring to Fig.1, speech identity feature extractor training method provided in an embodiment of the present invention may include:
Step S100, the speech feature vector of training voice is extracted.
Optionally, training voice can be obtained from preset training voice set, and the embodiment of the present invention can collect more in advance
Voice segments are simultaneously recorded in trained voice set, and the voice segments collected in advance can be considered a trained voice.
Optionally, speech feature vector is chosen as MFCC (Mel Frequency Cepstral Coefficient, Meier
Frequency cepstral coefficient) feature;The MFCC feature that voice can be trained by extracting realizes mentioning for the speech feature vector of training voice
It takes;
Optionally, the embodiment of the present invention can pre-process training voice, extract the phonetic feature for obtaining training voice
Vector;As a kind of optional realization, referring to Fig. 2, preprocessing process may include at the speech terminals detection (VAD) successively executed
Reason, preemphasis processing, framing add Hamming window processing, FFT (Fast Fourier Transformation, fast Fourier transform)
Processing, Mel (Meier) filtering processing, Log (taking logarithm) processing, DCT (anti-cosine transform) processing, CMVN (cepstral mean variance
Normalization) processing, Δ (first-order difference) processing and Δ Δ (second differnce) processing etc..
Optionally, the speech feature vector of training voice can be made of the speech feature vector of training each frame of voice, into one
The speech feature vector of step, training each frame of voice can gather the mentioned speech feature vector sequence to form trained voice;Such as i-th training
The mentioned speech feature vector sequence of voice is represented byWherein, xt iIndicate this i-th trained voice
T frame speech feature vector.
Step S110, according to the speech feature vector of the trained voice, the corresponding I- of trained voice is determined
vector。
After the speech feature vector for extracting trained voice, the embodiment of the present invention can be based on GMM (gauss hybrid models)
Model handles the speech feature vector for the training voice that extraction obtains, and determines the training corresponding I-vector (body of voice
Part factor).
Due to I-vector reliability establish more stringent voice duration etc. requirement on, phrase sound (duration compared with
Short voice, can limit a duration threshold value, such as 10 seconds, and duration is regarded as phrase sound lower than the voice of the duration threshold value)
When, the reliability of I-vector is lower;Therefore the embodiment of the present invention is after determining I-vector, not directly by I-
The speech identity feature that vector is identified as speaker's identity, but novel language is carried out further with I-vector subsequent
The training of sound identity characteristic extractor.
Step S120, it is exported using the I-vector as the first object of neural network model, to neural network model
Weight be adjusted, obtain first nerves network model.
Speech identity feature extractor provided in an embodiment of the present invention can be trained based on neural network model, nerve net
Network model such as DNN (Deep Neural Network, deep-neural-network) model, is not precluded CNN (convolutional Neural net certainly
Network) etc. other forms neural network model.
The embodiment of the present invention can be exported the training corresponding I-vector of voice as the first object of neural network model,
The weight of neural network model is adjusted, so that the output of neural network model is answered with first object output phase, is adjusted
First nerves network model after whole;Optionally, during this, the embodiment of the present invention can be with each defeated of neural network model
The mean square error between first object output is as loss function out, come supervise neural network model weight adjustment, make
The output for obtaining neural network model can finally tend to first object output (the i.e. described corresponding I-vector of trained voice), realize
The acquisition of first nerves network model.
Optionally, the used input of weight for adjusting neural network model can be according to the phonetic feature of the trained voice
Vector determines that the embodiment of the present invention can determine input speech feature vector according to the speech feature vector of the trained voice, with
The input of the input speech feature vector as neural network model, the I-vector as neural network model first
Target output, is adjusted the weight of neural network model;
Optionally, in the case where the input and first object for defining neural network model export, neural network mould is adjusted
The weight of type, so that the output of neural network model tends to such as can be used error reversed there are many modes that first object exports
Propagation algorithm carries out the weight adjustment of neural network model;It is exported in the input for defining neural network model and first object
In the case of, the weight adjustment means of specific neural network model, the embodiment of the present invention is with no restriction.
Optionally, speech feature vector (input as neural network model) is inputted, it can be by the language of training each frame of voice
Sound feature vector obtains;In a kind of optional realization, the adjacent setting number of frames of the sliceable trained voice of the embodiment of the present invention
Speech feature vector, obtains input speech feature vector, such as sliceable trained voice it is adjacent 9 (numerical value is only that example is said herein
It is bright) frame MFCC feature, obtain the input speech feature vector inputted as neural network model;Obviously, this determining input language
The mode of sound feature vector is only that optionally, the embodiment of the present invention can also be mentioned from the speech feature vector of training each frame of voice
The speech feature vector of multiframe is taken to splice to obtain input speech feature vector.
Optionally, further, before the weight of adjustment neural network model, the embodiment of the present invention can also be to neural network
Model is initialized;Such as neural network model (such as DNN model) is initialized using layering initial method, thus base
Neural network model after layering initialization carries out the adjustment of weight.
Step S130, the speech feature vector for obtaining target detection voice, determines the first nerves network model to institute
State the output result of the speech feature vector of target detection voice.
After training obtains first nerves network model, the embodiment of the present invention can obtain target detection voice, and extract mesh
The speech feature vector (such as MFCC feature) of mark detection voice, using the speech feature vector of the target detection voice as the first mind
Input through network model determines that first nerves network model exports result accordingly and (obtains first nerves network model pair
The output result of the speech feature vector of the target detection voice).
Step S140, according to the output as a result, determining identity factor hidden variable.
Optionally, first nerves network model is being obtained for the output result of the speech feature vector of target detection voice
Afterwards, the embodiment of the present invention can determine the mean value of the output result, with the mean value of the output result in training SNR (SIGNAL
NOISE RATIO, signal-to-noise ratio)-invariant (constant) PLDA (Probabilistic Linear Discriminative
Analysis, the linear distinguishing analysis of probability) during model, determine the identity factor (I-vector) hidden variable;
It should be noted that hidden variable is the proper noun in the factorial analysis theory of mathematics, it is believed that be recessive change
It measures (latent variable).
Step S150, the Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves
Second target of network model exports, and adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
After obtaining identity factor hidden variable (i.e. the hidden variable of I-vector), which contains more compact theory
People's information is talked about, there is higher reliability;Therefore the embodiment of the present invention can be using the Posterior Mean of identity factor hidden variable as instruction
The the second target output for practicing first nerves network model, so as to adjust the weight of first nerves network model, so that first nerves
The output of network model, which tends to the output of the second target, then can be obtained speech identity feature extractor after the completion of training.
It should be noted that Posterior Mean be mathematics probability theory in proper noun.
Optionally, input used in the weight of first nerves network model is adjusted in step S150, it can be according to target
The speech feature vector for detecting voice determines, such as the speech feature vector of the adjacent setting number of frames of sliceable target detection voice
(this mode carries out optional example) is adjusted input used in the weight of first nerves network model.Target detection language
The voice that sound can be said with target speaker (target speaker may be considered the legal speaker that need to be registered).
Since the embodiment of the present invention is that have the identity factor of high reliability hidden to contain more compact speaker information
Variable is target, and training obtains speech identity feature extractor;Therefore the voice extracted by the speech identity feature extractor
Identity characteristic is with higher reliability, it can be achieved that the extraction of the novel speech identity feature of high reliability;It is different from existing
I-vector, the obtained speech identity feature extractor of training of the embodiment of the present invention can be described as F-vector extractor, is based on
The speech identity feature that the speech identity feature extractor extracts can be described as F-vector.
Speech identity feature extractor training method provided in an embodiment of the present invention includes: to extract the voice spy of training voice
Levy vector;According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;With the I-
Vector is exported as the first object of neural network model, is adjusted to the weight of neural network model, and the first mind is obtained
Through network model;After obtaining first nerves network model, the speech feature vector of target detection voice can be obtained, determine described in
First nerves network model is to the output of the speech feature vector of the target detection voice as a result, to be tied according to the output
Fruit determines identity factor hidden variable;The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as neural network
Second target of model exports, and adjusts the weight of neural network model, obtains speech identity feature extractor, realizes novel language
The training of sound identity characteristic extractor.
Speech identity feature extractor training method provided in an embodiment of the present invention, be based on neural network model, with comprising
More compact speaker information, the Posterior Mean of the identity factor hidden variable with high reliability are target, and training obtains voice
Identity characteristic extractor may make the speech identity feature extracted by the speech identity feature extractor with higher reliable
Property, the requirement to voice can be reduced.The training method provided through the embodiment of the present invention can train to obtain novel speech identity
Feature extractor, the extraction for the novel speech identity feature of high reliability provide possibility.
To better understand the defect of I-vector, while illustrating the determination of I-vector provided by the embodiment of the present invention
Method, Fig. 3 show the method flow for determining the training corresponding I-vector of voice, and referring to Fig. 3, this method may include:
Step S200, it is based on GMM model, according to the speech feature vector of trained each frame of voice, determines sufficiently statistics
Amount.
The speech feature vector of training voice can be made of the speech feature vector of training each frame of voice, and training voice is each
The speech feature vector of frame can gather the mentioned speech feature vector sequence to form trained voice;
Optionally, if the mentioned speech feature vector sequence of i-th trained voice isWherein,Table
Show the t frame speech feature vector of this i-th article trained voice;
Sufficient statistic then can be determined according to the following formula based on the GMM model of k rank:
Indicate 0 rank sufficient statistic,Indicate that t frame speech feature vector occupies kth rank
Rate;
Indicate 1 rank sufficient statistic;
Wherein, the GMM model of k rank is represented bySmall letter k indicates the order of GMM model, w table
Show weight, m indicates mean value, and capitalization K indicates covariance.
Step S210, total variation space matrix is determined according to the sufficient statistic.
After determining sufficient statistic, total change used in I-vector extraction algorithm can be determined based on sufficient statistic
Change space matrix (being set as T);Optionally, EM (Expectation Maximization, expectation maximization) algorithm, root can be used
Total variation space matrix is iteratively solved out according to the sufficient statistic.EM algorithm may be considered a kind of to be solved in an iterative manner
The method of certainly a kind of special maximum likelihood (Maximum Likelihood) problem.
Step S220, according to total variation space matrix, the corresponding I-vector of trained voice is determined.
, can be according to total variation space matrix after obtaining always changing space matrix, the diagonal blocks that 0 rank sufficient statistic is constituted
The corresponding I-vector of trained voice is calculated in the result of matrix, 1 rank sufficient statistic splicing;
Optionally, determine that formula used in I-vector can be such that
Wherein I indicates that unit matrix, T (overstriking) indicate total variation space matrix, T
(not overstriking) indicates scalar value,Indicate that diagonal block matrix, the diagonal blocks ingredient of diagonal block matrix are
ByBe spliced, Σ indicates diagonal matrix, the diagonal entry of diagonal matrix by mixed number each in GMM model pair
The element of angle covariance matrix forms.
Optionally, after obtaining I-vector, posteriority covariance is represented byIt can be seen that
Voice duration is grown more in short-term, and the value of 0 corresponding rank statistic is with regard to smaller, and posteriority covariance is bigger at this time, estimated I-
Vector is more unreliable;This demonstrate the requirements with higher of duration of the reliability of I-vector for voice, in phrase
When sound, easily lead to that I-vector's is unreliable.
The embodiment of the present invention, can be defeated by first object of I-vector after obtaining training the corresponding I-vector of voice
Out, the weight adjustment for carrying out the neural network models of forms such as DNN, realizes training for the first time for neural network model, obtains first
Neural network model;It is based on first nerves network model again, is the output of the second target with the Posterior Mean of identity factor hidden variable,
Weight adjustment is carried out to first nerves network model, obtains speech identity feature extractor;
Optionally, the neural network model that the embodiment of the present invention uses can be DNN model, CNN model etc., i.e., trained
It can be DNN model, CNN model etc. to neural network model used in first nerves network model, correspondingly, first nerves
Network model is also possible to DNN model, CNN model etc..
It should be noted that DNN model is a kind of deep learning frame model, the structure of DNN model specifically includes that one layer
Input layer, multilayer hidden layer and one layer of output layer;In general, the first layer of DNN model is input layer, the last layer is output
Layer, and centre is then the hidden layer of multilayer, and DNN model connects entirely between layers;
Optionally, it by taking DNN model as an example, is exported by first object of I-vector, adjusts the weight of DNN model (i.e.
Parameter), during obtaining the first DNN model (a kind of form of first nerves network model), the embodiment of the present invention is available
The modes such as error backpropagation algorithm (other modes that can also be used DNN Model Weight to adjust certainly), adjust the power of DNN model
Weight obtains the first DNN model so that the output of DNN model adjusted tends to first object output;What this process was adjusted
The weight of DNN model specifically includes that the weight of the linear transformation between connection each layer of DNN model (such as connection input layer and hidden layer
Between, the weight of linear transformation between each hidden layer of connection, between connection hidden layer and output layer);
Correspondingly, adjusting the power of the first DNN model being the output of the second target with the Posterior Mean of identity factor hidden variable
Weight, during obtaining speech identity feature extractor, the modes such as error backpropagation algorithm are can also be used in the embodiment of the present invention,
The weight of the first DNN model is adjusted, so that the output of the first DNN model adjusted tends to the output of the second target, obtains voice
Identity characteristic extractor;The weight of the first DNN model adjusted during this may also comprise: connection each layer of DNN model it
Between linear transformation weight.
By taking CNN model as an example, the structure of CNN model mainly includes input layer, convolutional layer, pond layer and full articulamentum,
Middle convolutional layer and pond layer can have multilayer;Optionally, it is exported by first object of I-vector, adjustment CNN model
Weight (i.e. parameter), during obtaining the first CNN model (a kind of form of first nerves network model), the embodiment of the present invention
Using the modes such as error backpropagation algorithm (other modes that can also be used CNN Model Weight to adjust certainly), CNN mould is adjusted
The weight of type obtains the first CNN model so that the output of CNN model adjusted tends to first object output;This process institute
The weight of the CNN model of adjustment may include: the bias matrix of convolutional layer, the weight matrix of full articulamentum, full articulamentum it is inclined
Set the element in the model parameter of the CNN model such as vector;
Correspondingly, adjusting the power of the first CNN model being the output of the second target with the Posterior Mean of identity factor hidden variable
Weight, during obtaining speech identity feature extractor, the modes such as error backpropagation algorithm are can also be used in the embodiment of the present invention,
The weight of the first CNN model is adjusted, so that the output of the first CNN model adjusted tends to the output of the second target, obtains voice
Identity characteristic extractor;The weight of the first CNN model adjusted during this may also comprise: the initial bias square of convolutional layer
Battle array, the initial weight matrix of full articulamentum, the element in the model parameter of the CNN model such as initial bias vector of full articulamentum.
Obviously, structure and weight the adjustment means of above-mentioned neural network model are only optional, are limiting neural network mould
In the case that the input of type and target export, the embodiment of the present invention can tend to mesh using any output for making neural network model
Mark the weight adjustment means of output;The weight adjustment of neural network model can be iteration adjustment process, pass through the adjustment of iteration
The weight of neural network model, so that the output of neural network model tends to target output.
Optionally, in a kind of optional realization, the embodiment of the present invention can first be layered initial method to neural network mould
Type is initialized, and Artificial Neural Network Structures as shown in Figure 4 are obtained, and carries out the instruction of first nerves network model on this basis
It gets;
By taking the neural network model of DNN form as an example, Fig. 5 shows training and obtains the method stream of first nerves network model
Journey, referring to Fig. 5, this method may include:
Step S300, to be layered initial method initialization DNN model.
Step SS310, the speech feature vector for splicing the adjacent setting number of frames of training voice obtains input phonetic feature
Vector.
Step S320, using the input speech feature vector as the input of DNN model, the I-vector is as DNN
The first object of model exports, and it is loss function that DNN model exports the mean square error between first object output every time, right
The weight of DNN model is adjusted, and obtains the first DNN model.
Optionally, as an example, as shown in fig. 6, the phonetic feature of adjacent 9 frame of the sliceable trained voice of the embodiment of the present invention
Input of the vector as DNN model, the mean square error of the result and first object outlet chamber that are exported every time by DNN model are damage
Function, the weight of iteration adjustment DNN model are lost, until the output of DNN model tends to first object output, reaches training convergence item
Part obtains the first DNN model.
After the training for completing first nerves network model, identity factor hidden variable can be realized based on target detection voice
It determines;Optionally, the output of the corresponding first nerves network model of speech feature vector of target detection voice can be calculated as a result,
Speech feature vector as assumed i-th voice for s-th of speaker, it is corresponding can to calculate first nerves network model
Export result;Then determine that the mean value of output result (is set as Vsi), SNR-invariant is carried out by training data of the mean value
The training of PLDA (the linear distinguishing analysis of signal-to-noise ratio invariant probability) model, can be calculated the hidden change of the identity factor in the training process
Amount;
Optionally, training SNR-invariant PLDA model can be realized according to the following formula:
Wherein, b indicates the corresponding signal-to-noise ratio section of target detection voice, and m indicates equal
Value, R indicate speaker information space, and U indicates signal-to-noise ratio space, hiIndicate identity factor hidden variable, gbIndicate noise specific factor,Indicate residual error item.
During training SNR-invariant PLDA model, after determining identity factor hidden variable, identity can be estimated
The Posterior Mean of factor hidden variable, the Posterior Mean contain more compact speaker information, can be in this, as target output pair
First nerves network model carries out weight adjustment, and training obtains F-vector extractor, and (i.e. first nerves network model is after this
Testing mean value is that target output is trained, and the model result after training convergence is obtained F-vector extractor).
Optionally, by taking the neural network model of DNN form as an example, Fig. 7 is shown based on first nerves network model, training
The method flow of speech identity feature extractor (F-vector extractor) is obtained, referring to Fig. 7, this method may include:
Step S400, according to the speech feature vector of target detection voice, the input of the first DNN model is determined.
Optionally, the speech feature vector of the adjacent setting number of frames of sliceable target detection voice, obtains the first DNN mould
The input of type.
Step S410, it is exported with the second target that the Posterior Mean of identity factor hidden variable is the first DNN model, first
The mean square error that DNN model is exported every time between the output of the second target is loss function, is adjusted to the first DNN model,
Obtain speech identity feature extractor.
Optionally, as an example, as shown in figure 8, the sliceable target detection voice adjustment settings quantity of the embodiment of the present invention
Input of the speech feature vector of frame as the first DNN model, the result and the second target exported every time by the first DNN model
The mean square error of outlet chamber is loss function, the weight of the first DNN model of iteration adjustment, until the output of the first DNN model becomes
It is exported in the second target, reaches the trained condition of convergence, obtain speech identity feature extractor (F-vector extractor).
Optionally, in the basis to be layered initial method initialization DNN model, the training process of F-vector extractor
It can be as shown in figure 9, reference can be carried out;Wherein, w1 indicates first dimension of I-vector, and wn is n-th of I-vector
Dimension.
Training method provided in an embodiment of the present invention is based on neural network model, with comprising more compact speaker information,
The Posterior Mean of identity factor hidden variable with high reliability is target, and training obtains novel speech identity feature extraction
Device, it can be achieved that the novel speech identity feature of high reliability extraction, said by subsequent based on what speech identity feature carried out
Words people's identification provides higher accuracy guarantee.
On the basis of above-mentioned training obtains speech identity feature extractor, the embodiment of the present invention can be special based on speech identity
Extractor is levied, realizes the training for recognizing the classifier of different speakers, which can be based on predetermined speaker (if you need to registration
Speaker) voice realize training.
Optionally, Figure 10 shows classifier training method flow diagram provided in an embodiment of the present invention, and referring to Fig.1 0, the party
Method may include:
Step S500, the target detection voice of target speaker is obtained.
Requirement of the embodiment of the present invention for target detection voice is lower, and the duration of target detection voice can be arbitrarily
's.The target detection voice of target speaker can be the voice for the legal speaker that need to be registered, and the embodiment of the present invention can be based on
Speaker verification's scene (one-to-one identity validation problem), for target speaker, realizes the training of classifier;It is subsequent to lead to
The voice that the classifier that training obtains recognizes target speaker is crossed, realizes the higher speaker verification of precision.
Step S510, the speech feature vector of the target detection voice is extracted.
Optionally, the embodiment of the present invention can extract the MFCC feature of the target detection voice.
Step S520, the speech identity feature extractor for calling pre-training, by the phonetic feature of the target detection voice
Vector inputs speech identity feature extractor, obtains corresponding speech identity feature.
Based on previously described, it is that target exports training speech identity feature extractor using identity factor hidden variable, instructs
On the basis of getting speech identity feature extractor (F-vector extractor), the embodiment of the present invention can be by target detection voice
Input of the speech feature vector as F-vector extractor, F-vector extractor can accordingly export speech identity feature
(F-vector);
It is such as directed to i-th voice of speaker s, can extract the input after its MFCC feature as F-vector extractor,
Obtain corresponding F-vector.
Step S530, according to speech identity feature training classifier.
After obtaining speech identity feature, it may be determined that the mean value of speech identity feature obtains classifier with mean value training.
Optionally, the classifier that training of the embodiment of the present invention obtains can be used for the speaker verification scene unrelated with text;
Already described above, voice-based speaker's identity identification can be divided into speaker and recognize (Speaker Identification) and say
It talks about people and confirms (Speaker Verification) two classes;And in terms of the requirement to voice, voice-based speaker's identity
Identification can be divided into (Text-dependent) related to text and (Text-independent) two class unrelated with text again;With text
What this correlation indicated is that the voice to be measured that speaker says need to compare with registration voice semanteme having the same applied to speaker
The place of cooperation, what can not be indicated with text is the semantic content that can be not concerned in voice, and limiting factor is less, and application is more flexible
Extensively.
It need to identify with explanation, the unrelated speaker's identity of text since the semantic content for voice is unrestricted,
Therefore past in order to obtain preferable recognition performance under normal conditions the phenomenon that trained and test phase will appear voice mismatch
Toward a large amount of training voice of needs;And classifier provided in an embodiment of the present invention, it is based on novel lower for voice requirement
The training of speech identity feature obtain, therefore the identification accuracy of classifier can be greatly reduced as the duration of voice shortens and
The case where decline, to realize that accurate speaker's identity identification provides possibility.
Optionally, classifier provided in an embodiment of the present invention can be PLDA (the linear distinguishing analysis of probability) classifier, and one
Kind optionally can be as shown in figure 11 according to the process of speech identity feature training classifier, comprising:
Step 600, the mean value for determining the speech identity feature.
Assuming that after being extracted speech identity feature to i-th voice from speaker L, it may be determined that speech identity feature
Mean value ysi。
Step S610, the regular processing of covariance in class and the regular place of L2 norm are carried out to the mean value of the speech identity feature
Reason, the feature that obtains that treated, with treated feature training classifier.
Optionally, in the mean value y to speech identity featureliCarry out the regular processing of covariance in class and the regular processing of L2 norm
Afterwards, feature that treated can be used as training data training and obtain PLDA classifier.
Optionally, for nonparametric distinguishing analysis algorithm will be effectively embedding in PLDA classifier, PLDA classification is promoted
The training precision of device, the available PLDA classifier of the embodiment of the present invention can be based on following two kinds of covariance matrixes (i.e.
Covariance matrix between covariance matrix and nonparametric class in following classes) nonparametric PLDA model:
(1) covariance matrix in class, calculation can be such that
Wherein, S (capitalization) indicates speaker's number, and s (small letter) indicates s-th of speaker, HsIndicate s-th of speaker's
Voice strip number, usFor the mean value of s-th of speaker.
(2) covariance matrix between nonparametric class can be used following formula and calculate:
Wherein,Indicate in the feature for illustrating people k with feature ysiQ-th of feature of arest neighbors, Q are neighbour
The sum of feature, mk(ysi) mean value of Q neighbour's feature is represented, g (s, k, i) represents a weighting function, is defined as follows:
Wherein, index parameters α is the metric function d (y that adjusts the distance1, y2) weighting adjust, d (y1, y2) refer to feature y1And y2
Between Euclidean distance measurement, the value of parameter Q is generally set to the mean value of all total voice strip numbers of each speaker, weight
Function g (s, k, i) has evaluated the feature y after projectionsiThe degree of closeness on the classification boundary between local speaker, to determine
This feature ysiTo nonparametric class scatter matrix ΦbContribution degree.If feature ysiIf classification boundary, weighting function
G (s, k, i) is maximized 0.5, if far from classification boundary, the value of weighting function g (s, k, i) becomes smaller feature ysi therewith.
Feature in formula above refers to speech identity feature.
In acquirement class between covariance matrix and nonparametric class after covariance matrix, the embodiment of the present invention can classify PLDA
Transformation matrix replaces with covariance matrix in class in class in device scoring function, and transformation matrix replaces between the nonparametric class between class
Covariance matrix, specifically for given registered first speech identity feature extractor y1And second speech identity it is special
Levy extractor y2, constant term is omitted, then score (accuracy that score the illustrates PLDA classifier) calculating of PLDA classifier can
As following formula is realized:
score(y1,y2)=(y1-μ)TΦw(y1-μ)+2(y1-μ)TΦb(y2-μ)+(y2-μ)TΦw(y2-μ)
Wherein, u is population mean, i.e. the mean value of F-vector training set.
After training obtains classifier, the embodiment of the present invention can be based on the voice for realizing target speaker to classifier and its
The higher speaker verification of precision is realized in the identification of the voice of his speaker.When carrying out the identification of speaker, the present invention
Embodiment can extract the speech feature vector of current speaker, use speech identity feature extraction provided in an embodiment of the present invention
Device, the speech feature vector based on current speaker extract corresponding speech identity feature, which is inputted
For the classifier of target speaker training, by the output of classifier as a result, recognizing whether current speaker is that target is spoken
People realizes the identification of current speaker.
Optionally, the simplification process of training extractor of the embodiment of the present invention and classifier can be as shown in figure 12: by training language
Supervision message of the corresponding I-vector of sound as DNN model, establishes speech feature vector reflecting to I-vector feature space
It penetrates, extracts I-vector, DNN model is trained using I-vector as target;It is subsequent in order to obtain more compact say
People's information is talked about, identity factor hidden variable, identity-based factor hidden variable are determined in SNR-invariant PLDA modeling process
Supervision message DNN model is finely adjusted again, obtain final F-vector extractor;And then with F-vector extractor
The F-vector for extracting voice realizes the PLDA classifier for speaker's identity identification based on F-vector.
Speech identity feature extractor training device provided in an embodiment of the present invention is introduced below, it is described below
Speech identity feature extractor training device may be considered electronics and equipment (form of electronic equipment such as server or terminal
Deng) to realize the present invention embodiment provide speech identity feature extractor training method, the program module of required setting.Hereafter
The speech identity feature extractor training device of description can be mutual with above-described speech identity feature extractor training method
To should refer to.
Figure 13 is the structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention, referring to figure
13, which may include:
The first extraction module of speech feature vector 100, for extracting the speech feature vector of trained voice;
Identity factor determining module 110 determines the trained language for the speech feature vector according to the trained voice
The corresponding I-vector of sound;
First training module 120, for the first object output using the I-vector as neural network model, to mind
Weight through network model is adjusted, and obtains first nerves network model;
First result determining module 130 determines first mind for obtaining the speech feature vector of target detection voice
Through network model to the output result of the speech feature vector of the target detection voice;
Hidden variable determining module 140, for being exported according to described as a result, determining identity factor hidden variable;
Second training module 150, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean as institute
The the second target output for stating first nerves network model, adjusts the weight of the first nerves network model, obtains speech identity
Feature extractor.
Optionally, the first training module 120, it is defeated for the first object using the I-vector as neural network model
Out, the weight of neural network model is adjusted, obtains first nerves network model, specifically includes:
Input speech feature vector is determined according to the speech feature vector of the trained voice;
Using the input speech feature vector as the input of neural network model, the I-vector is as neural network
The first object of model exports, and the mean square error between each output and first object output of neural network model is as loss
Function is adjusted the weight of neural network model, obtains first nerves network model.
Optionally, first training module 120 is inputted for being determined according to the speech feature vector of the trained voice
Speech feature vector specifically includes:
The speech feature vector for splicing the adjacent setting number of frames of training voice obtains input speech feature vector.
Optionally, Figure 14 shows another knot of speech identity feature extractor training device provided in an embodiment of the present invention
Structure block diagram, in conjunction with shown in Figure 13 and Figure 14, which can also include:
Model initialization module 160, for be layered initial method initialization neural network model.
Optionally, model initialization module 160, can be to nerve to be layered initial method initialization neural network model
Before the weight of network model is adjusted;Correspondingly, the first training module 120 can neural network model after initial base
Function realization is carried out on plinth.
Optionally, hidden variable determining module 140, for being exported according to described as a result, determining identity factor hidden variable, specifically
Include:
The mean value for determining the output result, trains the constant SNR-invariant PLDA model of signal-to-noise ratio with the mean value,
Identity factor hidden variable is calculated in the training process.
Optionally, hidden variable determining module 140, for identity factor hidden variable to be calculated in the training process, specifically
Include:
According to formulaIdentity factor hidden variable h is calculatedi;
Wherein, VsiIndicate first nerves network model to the defeated of the speech feature vector of i-th voice of s-th of speaker
The mean value of result out, b indicate the corresponding signal-to-noise ratio section of target detection voice, and m indicates mean value, and R indicates speaker information space,
U indicates signal-to-noise ratio space, gbIndicate noise specific factor,Indicate residual error item.
Optionally, the second training module 150, for using the Posterior Mean as the first nerves network model
The output of two targets, adjusts the weight of the first nerves network model, obtains speech identity feature extractor, specifically include:
Splice the speech feature vector of target detection voice adjustment settings number of frames as the defeated of first nerves network model
Enter, is exported by the second target of first nerves network model of the Posterior Mean of identity factor hidden variable, first nerves network mould
The mean square error that type is exported every time between the output of the second target is loss function, is adjusted to first nerves network model,
Obtain speech identity feature extractor.
Optionally, identity factor determining module 110 determines institute for the speech feature vector according to the trained voice
The corresponding I-vector of trained voice is stated, is specifically included:
Sufficient statistic is determined according to the speech feature vector of trained each frame of voice based on GMM model;
Total variation space matrix is determined according to the sufficient statistic;According to total variation space matrix, determine described in
The training corresponding I-vector of voice.
The embodiment of the present invention also provides a kind of electronic equipment, and the speech identity feature extractor training device of foregoing description can
It being loaded into electronic equipment by program form, Figure 15 shows the hardware configuration of electronic equipment, and referring to Fig.1 5, the electronic equipment
It may include: at least one processor 1, at least one communication interface 2, at least one processor 3 and at least one communication bus
4;
In embodiments of the present invention, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one,
And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4;
Optionally, processor 1 may be a central processor CPU or specific integrated circuit ASIC
(Application Specific Integrated Circuit), or be arranged to implement the one of the embodiment of the present invention
A or multiple integrated circuits.
Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile
Memory), a for example, at least magnetic disk storage.
Wherein, memory is stored with executable program, which can be called by processor and execute, which can be used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out
Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection
The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model
The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
Optionally, the function refinement of the program and extension function can refer to the description of corresponding portion above, such as referring to voice
The description of identity characteristic extractor training method part.
The embodiment of the present invention also provides a kind of classifier training device, and classifier training device described below can consider
It is electronics and equipment (form of electronic equipment such as server or terminal etc.) the classifier instruction that embodiment provides to realize the present invention
Practice method, the program module of required setting.Classifier training device described below can be with above-described classifier training side
Method corresponds to each other reference.
Figure 16 is the structural block diagram of classifier training device provided in an embodiment of the present invention, referring to Fig.1 6, classifier instruction
Practicing device may include:
Target detection voice obtains module 200, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector 210, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module 220, for calling the speech identity feature extractor of pre-training, by the target
The speech feature vector for detecting voice inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, described
Speech identity feature extractor is obtained by target output training of identity factor hidden variable;
Training module 230, for according to speech identity feature training classifier.
Optionally, training module 230, for specifically including according to speech identity feature training classifier:
Determine the mean value of the speech identity feature;It is regular that covariance in class is carried out to the mean value of the speech identity feature
Processing and the regular processing of L2 norm, the feature that obtains that treated, with treated feature training classifier.
Optionally, in embodiments of the present invention, classifier can be based on PLDA model, correspondingly, the classifier can be
PLDA classifier;For the precision for promoting classifier, transformation matrix can be replaced in the class in the scoring function of the PLDA classifier
It is changed to covariance matrix in class, transformation matrix can be replaced by covariance matrix between the nonparametric class between class.
Optionally, classifier training device can be loaded into electronic equipment by program form, the structure of the electronic equipment
It can refer to shown in Figure 15, comprising: at least one processor;The memory is stored with executable program, which can be specific
For:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training
Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor
Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
The embodiment of the present invention can realize the training of novel speech identity feature extractor, the voice body obtained by training
Part feature extractor, it can be achieved that the novel speech identity feature of high reliability extraction;And then it can be based on the novel voice
Identity characteristic realizes the higher classifier training of precision, and the classifier obtained based on training can promote speaker's identity identification
Accuracy.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments in the case where not departing from core of the invention thought or scope.Therefore, originally
Invention is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein
Consistent widest scope.
Claims (15)
1. a kind of speech identity feature extractor training method characterized by comprising
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding identity factor I-vector of the trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is adjusted
It is whole, obtain first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection voice
Speech feature vector output result;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the second of the first nerves network model
Target output, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
2. speech identity feature extractor training method according to claim 1, which is characterized in that described with the I-
Vector is exported as the first object of neural network model, is adjusted to the weight of neural network model, and the first mind is obtained
Include: through network model
Input speech feature vector is determined according to the speech feature vector of the trained voice;
Using the input speech feature vector as the input of neural network model, the I-vector is as neural network model
First object output, the mean square error between each output and first object output of neural network model is as loss letter
Number, is adjusted the weight of neural network model, obtains first nerves network model.
3. speech identity feature extractor training method according to claim 2, which is characterized in that described according to the instruction
The speech feature vector for practicing voice determines that input speech feature vector includes:
The speech feature vector for splicing the adjacent setting number of frames of training voice obtains input speech feature vector.
4. speech identity feature extractor training method according to claim 1-3, which is characterized in that described right
Before the weight of neural network model is adjusted further include:
To be layered initial method initialization neural network model.
5. speech identity feature extractor training method according to claim 1, which is characterized in that described according to described defeated
Out as a result, determining that identity factor hidden variable includes:
The mean value of the output result is determined, with the mean value training linear distinguishing analysis SNR- of signal-to-noise ratio invariant probability
Identity factor hidden variable is calculated in invariant PLDA model in the training process.
6. speech identity feature extractor training method according to claim 5, which is characterized in that described in training process
In identity factor hidden variable be calculated include:
According to formulaIdentity factor hidden variable h is calculatedi;
Wherein, VsiIndicate first nerves network model to the output knot of the speech feature vector of i-th voice of s-th of speaker
The mean value of fruit, b indicate the corresponding signal-to-noise ratio section of target detection voice, and m indicates mean value, and R indicates speaker information space, U table
Show signal-to-noise ratio space, gbIndicate noise specific factor,Indicate residual error item.
7. speech identity feature extractor training method according to claim 1, which is characterized in that described with the posteriority
Mean value is exported as the second target of the first nerves network model, is adjusted the weight of the first nerves network model, is obtained
Include: to speech identity feature extractor
Splice input of the speech feature vector of target detection voice adjustment settings number of frames as first nerves network model, with
The Posterior Mean of identity factor hidden variable is that the second target of first nerves network model exports, and first nerves network model is each
Mean square error between output and the output of the second target is loss function, is adjusted to first nerves network model, obtains language
Sound identity characteristic extractor.
8. speech identity feature extractor training method according to claim 1, which is characterized in that described according to the instruction
The speech feature vector for practicing voice, determines that the corresponding I-vector of trained voice includes:
Based on gauss hybrid models GMM model, according to the speech feature vector of trained each frame of voice, sufficiently statistics is determined
Amount;
Total variation space matrix is determined according to the sufficient statistic;
According to total variation space matrix, the corresponding I-vector of trained voice is determined.
9. a kind of classifier training method characterized by comprising
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted voice body by the speech identity feature extractor for calling pre-training
Part feature extractor, obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the hidden change of the identity factor
Amount is that target output training obtains;
According to speech identity feature training classifier.
10. classifier training method according to claim 9, which is characterized in that described according to the speech identity feature
Training classifier include:
Determine the mean value of the speech identity feature;
The regular processing of covariance in class and the regular processing of L2 norm are carried out to the mean value of the speech identity feature, after obtaining processing
Feature, with treated feature training classifier.
11. classifier training method according to claim 9 or 10, which is characterized in that the classifier is based on probability line
Property distinguishing analysis PLDA model, the classifier be PLDA classifier;Become in class in the scoring function of the PLDA classifier
It changes matrix and is replaced by covariance matrix in class, transformation matrix is replaced by covariance matrix between nonparametric class between class.
12. a kind of speech identity feature extractor training device characterized by comprising
The first extraction module of speech feature vector, for extracting the speech feature vector of trained voice;
Identity factor determining module determines that the trained voice is corresponding for the speech feature vector according to the trained voice
Identity factor I-vector;
First training module, for the first object output using the I-vector as neural network model, to neural network
The weight of model is adjusted, and obtains first nerves network model;
First result determining module determines the first nerves network for obtaining the speech feature vector of target detection voice
Output result of the model to the speech feature vector of the target detection voice;
Hidden variable determining module, for being exported according to described as a result, determining identity factor hidden variable;
Second training module, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean as described first
Second target of neural network model exports, and adjusts the weight of the first nerves network model, obtains speech identity feature and mention
Take device.
13. a kind of electronic equipment characterized by comprising at least one processor;The memory is stored with executable journey
Sequence, described program are used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding identity factor I-vector of the trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is adjusted
It is whole, obtain first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection voice
Speech feature vector output result;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the second of the first nerves network model
Target output, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
14. a kind of classifier training device characterized by comprising
Target detection voice obtains module, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module, for calling the speech identity feature extractor of pre-training, by the target detection language
The speech feature vector of sound inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, the voice body
Part feature extractor is obtained by target output training of identity factor hidden variable;
Training module, for according to speech identity feature training classifier.
15. a kind of electronic equipment characterized by comprising at least one processor;The memory is stored with executable journey
Sequence, described program are used for:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted voice body by the speech identity feature extractor for calling pre-training
Part feature extractor, obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the hidden change of the identity factor
Amount is that target output training obtains;
According to speech identity feature training classifier.
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710910880.XA CN109584884B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
CN201910741216.6A CN110310647B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
PCT/CN2018/107385 WO2019062721A1 (en) | 2017-09-29 | 2018-09-25 | Training method for voice identity feature extractor and classifier and related devices |
US16/654,383 US11335352B2 (en) | 2017-09-29 | 2019-10-16 | Voice identity feature extractor and classifier training |
US17/720,876 US20220238117A1 (en) | 2017-09-29 | 2022-04-14 | Voice identity feature extractor and classifier training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710910880.XA CN109584884B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910741216.6A Division CN110310647B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109584884A true CN109584884A (en) | 2019-04-05 |
CN109584884B CN109584884B (en) | 2022-09-13 |
Family
ID=65900669
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710910880.XA Active CN109584884B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
CN201910741216.6A Active CN110310647B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910741216.6A Active CN110310647B (en) | 2017-09-29 | 2017-09-29 | Voice identity feature extractor, classifier training method and related equipment |
Country Status (3)
Country | Link |
---|---|
US (2) | US11335352B2 (en) |
CN (2) | CN109584884B (en) |
WO (1) | WO2019062721A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109920435A (en) * | 2019-04-09 | 2019-06-21 | 厦门快商通信息咨询有限公司 | A kind of method for recognizing sound-groove and voice print identification device |
CN110807333A (en) * | 2019-10-30 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Semantic processing method and device of semantic understanding model and storage medium |
CN113362829A (en) * | 2021-06-04 | 2021-09-07 | 思必驰科技股份有限公司 | Speaker verification method, electronic device and storage medium |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106169295B (en) * | 2016-07-15 | 2019-03-01 | 腾讯科技(深圳)有限公司 | Identity vector generation method and device |
CN107945806B (en) * | 2017-11-10 | 2022-03-08 | 北京小米移动软件有限公司 | User identification method and device based on sound characteristics |
US20190244062A1 (en) * | 2018-02-04 | 2019-08-08 | KaiKuTek Inc. | Gesture recognition method, gesture recognition system, and performing device therefore |
DE112018006885B4 (en) * | 2018-02-20 | 2021-11-04 | Mitsubishi Electric Corporation | TRAINING DEVICE, LANGUAGE ACTIVITY DETECTOR AND METHOD FOR DETECTING LANGUAGE ACTIVITY |
CN111583907B (en) * | 2020-04-15 | 2023-08-15 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111524525B (en) * | 2020-04-28 | 2023-06-16 | 平安科技(深圳)有限公司 | Voiceprint recognition method, device, equipment and storage medium of original voice |
CN112001215B (en) * | 2020-05-25 | 2023-11-24 | 天津大学 | Text irrelevant speaker identity recognition method based on three-dimensional lip movement |
CN112259078A (en) * | 2020-10-15 | 2021-01-22 | 上海依图网络科技有限公司 | Method and device for training audio recognition model and recognizing abnormal audio |
CN112164404A (en) * | 2020-10-28 | 2021-01-01 | 广西电网有限责任公司贺州供电局 | Remote identity authentication method and system based on voiceprint recognition technology |
CN112466298B (en) * | 2020-11-24 | 2023-08-11 | 杭州网易智企科技有限公司 | Voice detection method, device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
US20160293167A1 (en) * | 2013-10-10 | 2016-10-06 | Google Inc. | Speaker recognition using neural networks |
CN106971713A (en) * | 2017-01-18 | 2017-07-21 | 清华大学 | Speaker's labeling method and system based on density peaks cluster and variation Bayes |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI223791B (en) * | 2003-04-14 | 2004-11-11 | Ind Tech Res Inst | Method and system for utterance verification |
JP2008191444A (en) | 2007-02-06 | 2008-08-21 | Nec Electronics Corp | Display driver ic |
CN101241699B (en) * | 2008-03-14 | 2012-07-18 | 北京交通大学 | A speaker identification method for remote Chinese teaching |
CN102820033B (en) * | 2012-08-17 | 2013-12-04 | 南京大学 | Voiceprint identification method |
US9406298B2 (en) * | 2013-02-07 | 2016-08-02 | Nuance Communications, Inc. | Method and apparatus for efficient i-vector extraction |
US10438581B2 (en) * | 2013-07-31 | 2019-10-08 | Google Llc | Speech recognition using neural networks |
CN103391201B (en) * | 2013-08-05 | 2016-07-13 | 公安部第三研究所 | The system and method for smart card identity checking is realized based on Application on Voiceprint Recognition |
CN104765996B (en) * | 2014-01-06 | 2018-04-27 | 讯飞智元信息科技有限公司 | Voiceprint password authentication method and system |
CN105261367B (en) * | 2014-07-14 | 2019-03-15 | 中国科学院声学研究所 | A kind of method for distinguishing speek person |
US9373330B2 (en) * | 2014-08-07 | 2016-06-21 | Nuance Communications, Inc. | Fast speaker recognition scoring using I-vector posteriors and probabilistic linear discriminant analysis |
CN105096121B (en) * | 2015-06-25 | 2017-07-25 | 百度在线网络技术(北京)有限公司 | voiceprint authentication method and device |
CN105139856B (en) * | 2015-09-02 | 2019-07-09 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Probability linear discriminant method for distinguishing speek person based on the regular covariance of priori knowledge |
CN105895078A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method used for dynamically selecting speech model and device |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN105845140A (en) * | 2016-03-23 | 2016-08-10 | 广州势必可赢网络科技有限公司 | Speaker confirmation method and speaker confirmation device used in short voice condition |
CN106098068B (en) * | 2016-06-12 | 2019-07-16 | 腾讯科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
CN106169295B (en) * | 2016-07-15 | 2019-03-01 | 腾讯科技(深圳)有限公司 | Identity vector generation method and device |
CN107785015A (en) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method and device |
CN107610707B (en) * | 2016-12-15 | 2018-08-31 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
CN106847292B (en) * | 2017-02-16 | 2018-06-19 | 平安科技(深圳)有限公司 | Method for recognizing sound-groove and device |
CN107039036B (en) * | 2017-02-17 | 2020-06-16 | 南京邮电大学 | High-quality speaker recognition method based on automatic coding depth confidence network |
CN107146601B (en) * | 2017-04-07 | 2020-07-24 | 南京邮电大学 | Rear-end i-vector enhancement method for speaker recognition system |
US10347244B2 (en) * | 2017-04-21 | 2019-07-09 | Go-Vivace Inc. | Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response |
CN107633842B (en) * | 2017-06-12 | 2018-08-31 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
WO2019053898A1 (en) * | 2017-09-15 | 2019-03-21 | Nec Corporation | Pattern recognition apparatus, pattern recognition method, and storage medium |
-
2017
- 2017-09-29 CN CN201710910880.XA patent/CN109584884B/en active Active
- 2017-09-29 CN CN201910741216.6A patent/CN110310647B/en active Active
-
2018
- 2018-09-25 WO PCT/CN2018/107385 patent/WO2019062721A1/en active Application Filing
-
2019
- 2019-10-16 US US16/654,383 patent/US11335352B2/en active Active
-
2022
- 2022-04-14 US US17/720,876 patent/US20220238117A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160293167A1 (en) * | 2013-10-10 | 2016-10-06 | Google Inc. | Speaker recognition using neural networks |
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN106971713A (en) * | 2017-01-18 | 2017-07-21 | 清华大学 | Speaker's labeling method and system based on density peaks cluster and variation Bayes |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109920435A (en) * | 2019-04-09 | 2019-06-21 | 厦门快商通信息咨询有限公司 | A kind of method for recognizing sound-groove and voice print identification device |
CN110807333A (en) * | 2019-10-30 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Semantic processing method and device of semantic understanding model and storage medium |
CN110807333B (en) * | 2019-10-30 | 2024-02-06 | 腾讯科技(深圳)有限公司 | Semantic processing method, device and storage medium of semantic understanding model |
CN113362829A (en) * | 2021-06-04 | 2021-09-07 | 思必驰科技股份有限公司 | Speaker verification method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110310647A (en) | 2019-10-08 |
US11335352B2 (en) | 2022-05-17 |
CN110310647B (en) | 2022-02-25 |
US20220238117A1 (en) | 2022-07-28 |
WO2019062721A1 (en) | 2019-04-04 |
CN109584884B (en) | 2022-09-13 |
US20200043504A1 (en) | 2020-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109584884A (en) | A kind of speech identity feature extractor, classifier training method and relevant device | |
CN110189769B (en) | Abnormal sound detection method based on combination of multiple convolutional neural network models | |
CN104835498B (en) | Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter | |
CN112259106B (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
CN104900235B (en) | Method for recognizing sound-groove based on pitch period composite character parameter | |
WO2020181824A1 (en) | Voiceprint recognition method, apparatus and device, and computer-readable storage medium | |
US6253179B1 (en) | Method and apparatus for multi-environment speaker verification | |
CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
CN102324232A (en) | Method for recognizing sound-groove and system based on gauss hybrid models | |
EP0822539A2 (en) | Two-staged cohort selection for speaker verification system | |
CN102486922B (en) | Speaker recognition method, device and system | |
CN108109613A (en) | For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment | |
CN108986824A (en) | A kind of voice playback detection method | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN109192224A (en) | A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing | |
CN111048097B (en) | Twin network voiceprint recognition method based on 3D convolution | |
CN108091326A (en) | A kind of method for recognizing sound-groove and system based on linear regression | |
CN109448732A (en) | A kind of digit string processing method and processing device | |
CN110111798A (en) | A kind of method and terminal identifying speaker | |
CN107545898B (en) | Processing method and device for distinguishing speaker voice | |
Fasounaki et al. | CNN-based Text-independent automatic speaker identification using short utterances | |
CN100570712C (en) | Based on anchor model space projection ordinal number quick method for identifying speaker relatively | |
Weng et al. | The sysu system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge | |
Herrera-Camacho et al. | Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |