CN109584884A - A kind of speech identity feature extractor, classifier training method and relevant device - Google Patents

A kind of speech identity feature extractor, classifier training method and relevant device Download PDF

Info

Publication number
CN109584884A
CN109584884A CN201710910880.XA CN201710910880A CN109584884A CN 109584884 A CN109584884 A CN 109584884A CN 201710910880 A CN201710910880 A CN 201710910880A CN 109584884 A CN109584884 A CN 109584884A
Authority
CN
China
Prior art keywords
speech
identity
voice
network model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710910880.XA
Other languages
Chinese (zh)
Other versions
CN109584884B (en
Inventor
李娜
王珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710910880.XA priority Critical patent/CN109584884B/en
Priority to CN201910741216.6A priority patent/CN110310647B/en
Priority to PCT/CN2018/107385 priority patent/WO2019062721A1/en
Publication of CN109584884A publication Critical patent/CN109584884A/en
Priority to US16/654,383 priority patent/US11335352B2/en
Priority to US17/720,876 priority patent/US20220238117A1/en
Application granted granted Critical
Publication of CN109584884B publication Critical patent/CN109584884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Optimization (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Train Traffic Observation, Control, And Security (AREA)
  • Feedback Control In General (AREA)

Abstract

The embodiment of the present invention provides a kind of speech identity feature extractor, classifier training method and relevant device, which includes: to extract the speech feature vector of training voice;According to the speech feature vector of training voice, corresponding I-vector is determined;It is exported using I-vector as the first object of neural network model, the weight of neural network model is adjusted, first nerves network model is obtained;The speech feature vector for obtaining target detection voice, determines first nerves network model to the output result of the speech feature vector of target detection voice;According to output as a result, determining identity factor hidden variable;The Posterior Mean for estimating identity factor hidden variable is exported using Posterior Mean as the second target of first nerves network model, is adjusted the weight of first nerves network model, is obtained speech identity feature extractor.It can train to obtain novel speech identity feature extractor through the embodiment of the present invention, the extraction for the novel speech identity feature of high reliability provides possibility.

Description

A kind of speech identity feature extractor, classifier training method and relevant device
Technical field
The present invention relates to voice technology fields, and in particular to a kind of speech identity feature extractor, classifier training method And relevant device.
Background technique
Voice is since the characteristics such as acquisition is easy, is easy to store, is difficult to imitate are in more and more identification scenes To application, solve many information security issues to be related to the place of Information Security.Voice-based speaker's identity Identification can be divided into speaker and recognize (Speaker Identification) and speaker verification (Speaker Verification) two class;Speaker's identification is mainly based upon the voice to be measured that speaker says, and judges whether speaker belongs to One in registered speaker's set, be one-to-many identification problem;Speaker verification be said based on speaker to Voice is surveyed, judges whether speaker is a registered target speaker, is one-to-one confirmation problem.
When carrying out speaker's identity identification based on voice, the voice based on speaker is needed to extract expression speaker's identity The speech identity feature of information is handled the speech identity feature by classifier trained in advance, is spoken to realize The identification of people.Currently, speech identity feature is mainly used as using I-vector (the identity factor, Identity-vector), It is that the currently used speech identity for carrying out speaker's identity identification is special although I-vector is able to reflect speaker's acoustic difference Sign, but the inventors found that: the reliability of I-vector is established on requiring more stringent voice, in voice duration It is shorter etc. it is undesirable in the case where, the reliability of I-vector will can be greatly reduced.
Therefore how a kind of novel speech identity feature extractor is provided, realize the novel language for being different from I-vector The extraction of sound identity characteristic becomes that those skilled in the art are in need of consideration to be asked to promote the reliability of speech identity feature Topic.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of speech identity feature extractor, classifier training method and correlation Equipment realizes the extraction of the novel speech identity feature of high reliability to provide novel speech identity feature extractor;Into One step realizes speaker's identity identification based on the novel speech identity feature, promotes the accuracy of speaker's identity identification.
To achieve the above object, the embodiment of the present invention provides the following technical solutions:
A kind of speech identity feature extractor training method, comprising:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
The embodiment of the present invention also provides a kind of classifier training method, comprising:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
The embodiment of the present invention also provides a kind of speech identity feature extractor training device, comprising:
The first extraction module of speech feature vector, for extracting the speech feature vector of trained voice;
Identity factor determining module determines the trained voice for the speech feature vector according to the trained voice Corresponding I-vector;
First training module, for the first object output using the I-vector as neural network model, to nerve The weight of network model is adjusted, and obtains first nerves network model;
First result determining module determines the first nerves for obtaining the speech feature vector of target detection voice Output result of the network model to the speech feature vector of the target detection voice;
Hidden variable determining module, for being exported according to described as a result, determining identity factor hidden variable;
Second training module, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean described in Second target of first nerves network model exports, and adjusts the weight of the first nerves network model, obtains speech identity spy Levy extractor.
The embodiment of the present invention also provides a kind of electronic equipment, comprising: at least one processor;The memory is stored with can The program of execution, described program are used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
The embodiment of the present invention also provides a kind of classifier training device, comprising:
Target detection voice obtains module, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module examines the target for calling the speech identity feature extractor of pre-training The speech feature vector for surveying voice inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, institute's predicate Sound identity characteristic extractor is obtained by target output training of identity factor hidden variable;
Training module, for according to speech identity feature training classifier.
The embodiment of the present invention also provides a kind of electronic equipment, comprising: at least one processor;The memory is stored with can The program of execution, described program are used for:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
Based on the above-mentioned technical proposal, speech identity feature extractor training method provided in an embodiment of the present invention includes: to mention Take the speech feature vector of trained voice;According to the speech feature vector of the trained voice, determine that the trained voice is corresponding I-vector;It is exported using the I-vector as the first object of neural network model, to the weight of neural network model It is adjusted, obtains first nerves network model;After obtaining first nerves network model, the language of target detection voice can be obtained Sound feature vector determines the first nerves network model to the output knot of the speech feature vector of the target detection voice Fruit, thus according to the output as a result, determining identity factor hidden variable;The Posterior Mean of identity factor hidden variable is estimated, with institute The second target that Posterior Mean is stated as neural network model exports, and adjusts the weight of neural network model, obtains speech identity Feature extractor realizes the training of novel speech identity feature extractor.
Speech identity feature extractor training method provided in an embodiment of the present invention, be based on neural network model, with comprising More compact speaker information, the Posterior Mean of the identity factor hidden variable with high reliability are target, and training obtains voice Identity characteristic extractor may make the speech identity feature extracted by the speech identity feature extractor with higher reliable Property, the requirement to voice can be reduced.The training method provided through the embodiment of the present invention can train to obtain novel speech identity Feature extractor, the extraction for the novel speech identity feature of high reliability provide possibility.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart of speech identity feature extractor training method provided in an embodiment of the present invention;
Fig. 2 is to carry out pretreated process schematic to training voice;
Fig. 3 is the method flow diagram for determining the training corresponding I-vector of voice;
Fig. 4 is that the layering of neural network model initializes schematic diagram;
Fig. 5 is that training obtains the method flow diagram of first nerves network model;
Fig. 6 is that training obtains the process schematic of first nerves network model;
Fig. 7 is to obtain the method flow diagram of speech identity feature extractor based on the training of first nerves network model;
Fig. 8 is that training obtains the process schematic of speech identity feature extractor;
Fig. 9 is the process schematic of training F-vector extractor on the neural network model of layering initialization;
Figure 10 is classifier training method flow diagram provided in an embodiment of the present invention;
Figure 11 is the method flow diagram according to speech identity feature training classifier;
Figure 12 is the simplification process schematic of training extractor and classifier of the embodiment of the present invention;
Figure 13 is the structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention;
Figure 14 is another structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention;
Figure 15 is the hardware block diagram of electronic equipment;
Figure 16 is the structural block diagram of classifier training device provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of speech identity feature extractor training method provided in an embodiment of the present invention, passes through the voice Identity characteristic extractor training method can train to obtain novel speech identity feature extractor provided in an embodiment of the present invention, base It is special that the higher speech identity for being different from I-vector of reliability can be extracted from voice in the speech identity feature extractor Sign;
Method shown in Fig. 1 can be applied to the electronic equipment with data-handling capacity, and electronic equipment can be arranged such as network side Server, user equipmenies, the form of electronic equipment such as mobile phone, PC (personal computer) that user side uses specifically visually use Depending on demand;It is corresponding that the embodiment of the present invention can load the speech identity feature extractor training method in the electronic equipment Program realizes the execution of speech identity feature extractor training method provided in an embodiment of the present invention;
Referring to Fig.1, speech identity feature extractor training method provided in an embodiment of the present invention may include:
Step S100, the speech feature vector of training voice is extracted.
Optionally, training voice can be obtained from preset training voice set, and the embodiment of the present invention can collect more in advance Voice segments are simultaneously recorded in trained voice set, and the voice segments collected in advance can be considered a trained voice.
Optionally, speech feature vector is chosen as MFCC (Mel Frequency Cepstral Coefficient, Meier Frequency cepstral coefficient) feature;The MFCC feature that voice can be trained by extracting realizes mentioning for the speech feature vector of training voice It takes;
Optionally, the embodiment of the present invention can pre-process training voice, extract the phonetic feature for obtaining training voice Vector;As a kind of optional realization, referring to Fig. 2, preprocessing process may include at the speech terminals detection (VAD) successively executed Reason, preemphasis processing, framing add Hamming window processing, FFT (Fast Fourier Transformation, fast Fourier transform) Processing, Mel (Meier) filtering processing, Log (taking logarithm) processing, DCT (anti-cosine transform) processing, CMVN (cepstral mean variance Normalization) processing, Δ (first-order difference) processing and Δ Δ (second differnce) processing etc..
Optionally, the speech feature vector of training voice can be made of the speech feature vector of training each frame of voice, into one The speech feature vector of step, training each frame of voice can gather the mentioned speech feature vector sequence to form trained voice;Such as i-th training The mentioned speech feature vector sequence of voice is represented byWherein, xt iIndicate this i-th trained voice T frame speech feature vector.
Step S110, according to the speech feature vector of the trained voice, the corresponding I- of trained voice is determined vector。
After the speech feature vector for extracting trained voice, the embodiment of the present invention can be based on GMM (gauss hybrid models) Model handles the speech feature vector for the training voice that extraction obtains, and determines the training corresponding I-vector (body of voice Part factor).
Due to I-vector reliability establish more stringent voice duration etc. requirement on, phrase sound (duration compared with Short voice, can limit a duration threshold value, such as 10 seconds, and duration is regarded as phrase sound lower than the voice of the duration threshold value) When, the reliability of I-vector is lower;Therefore the embodiment of the present invention is after determining I-vector, not directly by I- The speech identity feature that vector is identified as speaker's identity, but novel language is carried out further with I-vector subsequent The training of sound identity characteristic extractor.
Step S120, it is exported using the I-vector as the first object of neural network model, to neural network model Weight be adjusted, obtain first nerves network model.
Speech identity feature extractor provided in an embodiment of the present invention can be trained based on neural network model, nerve net Network model such as DNN (Deep Neural Network, deep-neural-network) model, is not precluded CNN (convolutional Neural net certainly Network) etc. other forms neural network model.
The embodiment of the present invention can be exported the training corresponding I-vector of voice as the first object of neural network model, The weight of neural network model is adjusted, so that the output of neural network model is answered with first object output phase, is adjusted First nerves network model after whole;Optionally, during this, the embodiment of the present invention can be with each defeated of neural network model The mean square error between first object output is as loss function out, come supervise neural network model weight adjustment, make The output for obtaining neural network model can finally tend to first object output (the i.e. described corresponding I-vector of trained voice), realize The acquisition of first nerves network model.
Optionally, the used input of weight for adjusting neural network model can be according to the phonetic feature of the trained voice Vector determines that the embodiment of the present invention can determine input speech feature vector according to the speech feature vector of the trained voice, with The input of the input speech feature vector as neural network model, the I-vector as neural network model first Target output, is adjusted the weight of neural network model;
Optionally, in the case where the input and first object for defining neural network model export, neural network mould is adjusted The weight of type, so that the output of neural network model tends to such as can be used error reversed there are many modes that first object exports Propagation algorithm carries out the weight adjustment of neural network model;It is exported in the input for defining neural network model and first object In the case of, the weight adjustment means of specific neural network model, the embodiment of the present invention is with no restriction.
Optionally, speech feature vector (input as neural network model) is inputted, it can be by the language of training each frame of voice Sound feature vector obtains;In a kind of optional realization, the adjacent setting number of frames of the sliceable trained voice of the embodiment of the present invention Speech feature vector, obtains input speech feature vector, such as sliceable trained voice it is adjacent 9 (numerical value is only that example is said herein It is bright) frame MFCC feature, obtain the input speech feature vector inputted as neural network model;Obviously, this determining input language The mode of sound feature vector is only that optionally, the embodiment of the present invention can also be mentioned from the speech feature vector of training each frame of voice The speech feature vector of multiframe is taken to splice to obtain input speech feature vector.
Optionally, further, before the weight of adjustment neural network model, the embodiment of the present invention can also be to neural network Model is initialized;Such as neural network model (such as DNN model) is initialized using layering initial method, thus base Neural network model after layering initialization carries out the adjustment of weight.
Step S130, the speech feature vector for obtaining target detection voice, determines the first nerves network model to institute State the output result of the speech feature vector of target detection voice.
After training obtains first nerves network model, the embodiment of the present invention can obtain target detection voice, and extract mesh The speech feature vector (such as MFCC feature) of mark detection voice, using the speech feature vector of the target detection voice as the first mind Input through network model determines that first nerves network model exports result accordingly and (obtains first nerves network model pair The output result of the speech feature vector of the target detection voice).
Step S140, according to the output as a result, determining identity factor hidden variable.
Optionally, first nerves network model is being obtained for the output result of the speech feature vector of target detection voice Afterwards, the embodiment of the present invention can determine the mean value of the output result, with the mean value of the output result in training SNR (SIGNAL NOISE RATIO, signal-to-noise ratio)-invariant (constant) PLDA (Probabilistic Linear Discriminative Analysis, the linear distinguishing analysis of probability) during model, determine the identity factor (I-vector) hidden variable;
It should be noted that hidden variable is the proper noun in the factorial analysis theory of mathematics, it is believed that be recessive change It measures (latent variable).
Step S150, the Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves Second target of network model exports, and adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
After obtaining identity factor hidden variable (i.e. the hidden variable of I-vector), which contains more compact theory People's information is talked about, there is higher reliability;Therefore the embodiment of the present invention can be using the Posterior Mean of identity factor hidden variable as instruction The the second target output for practicing first nerves network model, so as to adjust the weight of first nerves network model, so that first nerves The output of network model, which tends to the output of the second target, then can be obtained speech identity feature extractor after the completion of training.
It should be noted that Posterior Mean be mathematics probability theory in proper noun.
Optionally, input used in the weight of first nerves network model is adjusted in step S150, it can be according to target The speech feature vector for detecting voice determines, such as the speech feature vector of the adjacent setting number of frames of sliceable target detection voice (this mode carries out optional example) is adjusted input used in the weight of first nerves network model.Target detection language The voice that sound can be said with target speaker (target speaker may be considered the legal speaker that need to be registered).
Since the embodiment of the present invention is that have the identity factor of high reliability hidden to contain more compact speaker information Variable is target, and training obtains speech identity feature extractor;Therefore the voice extracted by the speech identity feature extractor Identity characteristic is with higher reliability, it can be achieved that the extraction of the novel speech identity feature of high reliability;It is different from existing I-vector, the obtained speech identity feature extractor of training of the embodiment of the present invention can be described as F-vector extractor, is based on The speech identity feature that the speech identity feature extractor extracts can be described as F-vector.
Speech identity feature extractor training method provided in an embodiment of the present invention includes: to extract the voice spy of training voice Levy vector;According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;With the I- Vector is exported as the first object of neural network model, is adjusted to the weight of neural network model, and the first mind is obtained Through network model;After obtaining first nerves network model, the speech feature vector of target detection voice can be obtained, determine described in First nerves network model is to the output of the speech feature vector of the target detection voice as a result, to be tied according to the output Fruit determines identity factor hidden variable;The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as neural network Second target of model exports, and adjusts the weight of neural network model, obtains speech identity feature extractor, realizes novel language The training of sound identity characteristic extractor.
Speech identity feature extractor training method provided in an embodiment of the present invention, be based on neural network model, with comprising More compact speaker information, the Posterior Mean of the identity factor hidden variable with high reliability are target, and training obtains voice Identity characteristic extractor may make the speech identity feature extracted by the speech identity feature extractor with higher reliable Property, the requirement to voice can be reduced.The training method provided through the embodiment of the present invention can train to obtain novel speech identity Feature extractor, the extraction for the novel speech identity feature of high reliability provide possibility.
To better understand the defect of I-vector, while illustrating the determination of I-vector provided by the embodiment of the present invention Method, Fig. 3 show the method flow for determining the training corresponding I-vector of voice, and referring to Fig. 3, this method may include:
Step S200, it is based on GMM model, according to the speech feature vector of trained each frame of voice, determines sufficiently statistics Amount.
The speech feature vector of training voice can be made of the speech feature vector of training each frame of voice, and training voice is each The speech feature vector of frame can gather the mentioned speech feature vector sequence to form trained voice;
Optionally, if the mentioned speech feature vector sequence of i-th trained voice isWherein,Table Show the t frame speech feature vector of this i-th article trained voice;
Sufficient statistic then can be determined according to the following formula based on the GMM model of k rank:
Indicate 0 rank sufficient statistic,Indicate that t frame speech feature vector occupies kth rank Rate;
Indicate 1 rank sufficient statistic;
Wherein, the GMM model of k rank is represented bySmall letter k indicates the order of GMM model, w table Show weight, m indicates mean value, and capitalization K indicates covariance.
Step S210, total variation space matrix is determined according to the sufficient statistic.
After determining sufficient statistic, total change used in I-vector extraction algorithm can be determined based on sufficient statistic Change space matrix (being set as T);Optionally, EM (Expectation Maximization, expectation maximization) algorithm, root can be used Total variation space matrix is iteratively solved out according to the sufficient statistic.EM algorithm may be considered a kind of to be solved in an iterative manner The method of certainly a kind of special maximum likelihood (Maximum Likelihood) problem.
Step S220, according to total variation space matrix, the corresponding I-vector of trained voice is determined.
, can be according to total variation space matrix after obtaining always changing space matrix, the diagonal blocks that 0 rank sufficient statistic is constituted The corresponding I-vector of trained voice is calculated in the result of matrix, 1 rank sufficient statistic splicing;
Optionally, determine that formula used in I-vector can be such that
Wherein I indicates that unit matrix, T (overstriking) indicate total variation space matrix, T (not overstriking) indicates scalar value,Indicate that diagonal block matrix, the diagonal blocks ingredient of diagonal block matrix are ByBe spliced, Σ indicates diagonal matrix, the diagonal entry of diagonal matrix by mixed number each in GMM model pair The element of angle covariance matrix forms.
Optionally, after obtaining I-vector, posteriority covariance is represented byIt can be seen that Voice duration is grown more in short-term, and the value of 0 corresponding rank statistic is with regard to smaller, and posteriority covariance is bigger at this time, estimated I- Vector is more unreliable;This demonstrate the requirements with higher of duration of the reliability of I-vector for voice, in phrase When sound, easily lead to that I-vector's is unreliable.
The embodiment of the present invention, can be defeated by first object of I-vector after obtaining training the corresponding I-vector of voice Out, the weight adjustment for carrying out the neural network models of forms such as DNN, realizes training for the first time for neural network model, obtains first Neural network model;It is based on first nerves network model again, is the output of the second target with the Posterior Mean of identity factor hidden variable, Weight adjustment is carried out to first nerves network model, obtains speech identity feature extractor;
Optionally, the neural network model that the embodiment of the present invention uses can be DNN model, CNN model etc., i.e., trained It can be DNN model, CNN model etc. to neural network model used in first nerves network model, correspondingly, first nerves Network model is also possible to DNN model, CNN model etc..
It should be noted that DNN model is a kind of deep learning frame model, the structure of DNN model specifically includes that one layer Input layer, multilayer hidden layer and one layer of output layer;In general, the first layer of DNN model is input layer, the last layer is output Layer, and centre is then the hidden layer of multilayer, and DNN model connects entirely between layers;
Optionally, it by taking DNN model as an example, is exported by first object of I-vector, adjusts the weight of DNN model (i.e. Parameter), during obtaining the first DNN model (a kind of form of first nerves network model), the embodiment of the present invention is available The modes such as error backpropagation algorithm (other modes that can also be used DNN Model Weight to adjust certainly), adjust the power of DNN model Weight obtains the first DNN model so that the output of DNN model adjusted tends to first object output;What this process was adjusted The weight of DNN model specifically includes that the weight of the linear transformation between connection each layer of DNN model (such as connection input layer and hidden layer Between, the weight of linear transformation between each hidden layer of connection, between connection hidden layer and output layer);
Correspondingly, adjusting the power of the first DNN model being the output of the second target with the Posterior Mean of identity factor hidden variable Weight, during obtaining speech identity feature extractor, the modes such as error backpropagation algorithm are can also be used in the embodiment of the present invention, The weight of the first DNN model is adjusted, so that the output of the first DNN model adjusted tends to the output of the second target, obtains voice Identity characteristic extractor;The weight of the first DNN model adjusted during this may also comprise: connection each layer of DNN model it Between linear transformation weight.
By taking CNN model as an example, the structure of CNN model mainly includes input layer, convolutional layer, pond layer and full articulamentum, Middle convolutional layer and pond layer can have multilayer;Optionally, it is exported by first object of I-vector, adjustment CNN model Weight (i.e. parameter), during obtaining the first CNN model (a kind of form of first nerves network model), the embodiment of the present invention Using the modes such as error backpropagation algorithm (other modes that can also be used CNN Model Weight to adjust certainly), CNN mould is adjusted The weight of type obtains the first CNN model so that the output of CNN model adjusted tends to first object output;This process institute The weight of the CNN model of adjustment may include: the bias matrix of convolutional layer, the weight matrix of full articulamentum, full articulamentum it is inclined Set the element in the model parameter of the CNN model such as vector;
Correspondingly, adjusting the power of the first CNN model being the output of the second target with the Posterior Mean of identity factor hidden variable Weight, during obtaining speech identity feature extractor, the modes such as error backpropagation algorithm are can also be used in the embodiment of the present invention, The weight of the first CNN model is adjusted, so that the output of the first CNN model adjusted tends to the output of the second target, obtains voice Identity characteristic extractor;The weight of the first CNN model adjusted during this may also comprise: the initial bias square of convolutional layer Battle array, the initial weight matrix of full articulamentum, the element in the model parameter of the CNN model such as initial bias vector of full articulamentum.
Obviously, structure and weight the adjustment means of above-mentioned neural network model are only optional, are limiting neural network mould In the case that the input of type and target export, the embodiment of the present invention can tend to mesh using any output for making neural network model Mark the weight adjustment means of output;The weight adjustment of neural network model can be iteration adjustment process, pass through the adjustment of iteration The weight of neural network model, so that the output of neural network model tends to target output.
Optionally, in a kind of optional realization, the embodiment of the present invention can first be layered initial method to neural network mould Type is initialized, and Artificial Neural Network Structures as shown in Figure 4 are obtained, and carries out the instruction of first nerves network model on this basis It gets;
By taking the neural network model of DNN form as an example, Fig. 5 shows training and obtains the method stream of first nerves network model Journey, referring to Fig. 5, this method may include:
Step S300, to be layered initial method initialization DNN model.
Step SS310, the speech feature vector for splicing the adjacent setting number of frames of training voice obtains input phonetic feature Vector.
Step S320, using the input speech feature vector as the input of DNN model, the I-vector is as DNN The first object of model exports, and it is loss function that DNN model exports the mean square error between first object output every time, right The weight of DNN model is adjusted, and obtains the first DNN model.
Optionally, as an example, as shown in fig. 6, the phonetic feature of adjacent 9 frame of the sliceable trained voice of the embodiment of the present invention Input of the vector as DNN model, the mean square error of the result and first object outlet chamber that are exported every time by DNN model are damage Function, the weight of iteration adjustment DNN model are lost, until the output of DNN model tends to first object output, reaches training convergence item Part obtains the first DNN model.
After the training for completing first nerves network model, identity factor hidden variable can be realized based on target detection voice It determines;Optionally, the output of the corresponding first nerves network model of speech feature vector of target detection voice can be calculated as a result, Speech feature vector as assumed i-th voice for s-th of speaker, it is corresponding can to calculate first nerves network model Export result;Then determine that the mean value of output result (is set as Vsi), SNR-invariant is carried out by training data of the mean value The training of PLDA (the linear distinguishing analysis of signal-to-noise ratio invariant probability) model, can be calculated the hidden change of the identity factor in the training process Amount;
Optionally, training SNR-invariant PLDA model can be realized according to the following formula:
Wherein, b indicates the corresponding signal-to-noise ratio section of target detection voice, and m indicates equal Value, R indicate speaker information space, and U indicates signal-to-noise ratio space, hiIndicate identity factor hidden variable, gbIndicate noise specific factor,Indicate residual error item.
During training SNR-invariant PLDA model, after determining identity factor hidden variable, identity can be estimated The Posterior Mean of factor hidden variable, the Posterior Mean contain more compact speaker information, can be in this, as target output pair First nerves network model carries out weight adjustment, and training obtains F-vector extractor, and (i.e. first nerves network model is after this Testing mean value is that target output is trained, and the model result after training convergence is obtained F-vector extractor).
Optionally, by taking the neural network model of DNN form as an example, Fig. 7 is shown based on first nerves network model, training The method flow of speech identity feature extractor (F-vector extractor) is obtained, referring to Fig. 7, this method may include:
Step S400, according to the speech feature vector of target detection voice, the input of the first DNN model is determined.
Optionally, the speech feature vector of the adjacent setting number of frames of sliceable target detection voice, obtains the first DNN mould The input of type.
Step S410, it is exported with the second target that the Posterior Mean of identity factor hidden variable is the first DNN model, first The mean square error that DNN model is exported every time between the output of the second target is loss function, is adjusted to the first DNN model, Obtain speech identity feature extractor.
Optionally, as an example, as shown in figure 8, the sliceable target detection voice adjustment settings quantity of the embodiment of the present invention Input of the speech feature vector of frame as the first DNN model, the result and the second target exported every time by the first DNN model The mean square error of outlet chamber is loss function, the weight of the first DNN model of iteration adjustment, until the output of the first DNN model becomes It is exported in the second target, reaches the trained condition of convergence, obtain speech identity feature extractor (F-vector extractor).
Optionally, in the basis to be layered initial method initialization DNN model, the training process of F-vector extractor It can be as shown in figure 9, reference can be carried out;Wherein, w1 indicates first dimension of I-vector, and wn is n-th of I-vector Dimension.
Training method provided in an embodiment of the present invention is based on neural network model, with comprising more compact speaker information, The Posterior Mean of identity factor hidden variable with high reliability is target, and training obtains novel speech identity feature extraction Device, it can be achieved that the novel speech identity feature of high reliability extraction, said by subsequent based on what speech identity feature carried out Words people's identification provides higher accuracy guarantee.
On the basis of above-mentioned training obtains speech identity feature extractor, the embodiment of the present invention can be special based on speech identity Extractor is levied, realizes the training for recognizing the classifier of different speakers, which can be based on predetermined speaker (if you need to registration Speaker) voice realize training.
Optionally, Figure 10 shows classifier training method flow diagram provided in an embodiment of the present invention, and referring to Fig.1 0, the party Method may include:
Step S500, the target detection voice of target speaker is obtained.
Requirement of the embodiment of the present invention for target detection voice is lower, and the duration of target detection voice can be arbitrarily 's.The target detection voice of target speaker can be the voice for the legal speaker that need to be registered, and the embodiment of the present invention can be based on Speaker verification's scene (one-to-one identity validation problem), for target speaker, realizes the training of classifier;It is subsequent to lead to The voice that the classifier that training obtains recognizes target speaker is crossed, realizes the higher speaker verification of precision.
Step S510, the speech feature vector of the target detection voice is extracted.
Optionally, the embodiment of the present invention can extract the MFCC feature of the target detection voice.
Step S520, the speech identity feature extractor for calling pre-training, by the phonetic feature of the target detection voice Vector inputs speech identity feature extractor, obtains corresponding speech identity feature.
Based on previously described, it is that target exports training speech identity feature extractor using identity factor hidden variable, instructs On the basis of getting speech identity feature extractor (F-vector extractor), the embodiment of the present invention can be by target detection voice Input of the speech feature vector as F-vector extractor, F-vector extractor can accordingly export speech identity feature (F-vector);
It is such as directed to i-th voice of speaker s, can extract the input after its MFCC feature as F-vector extractor, Obtain corresponding F-vector.
Step S530, according to speech identity feature training classifier.
After obtaining speech identity feature, it may be determined that the mean value of speech identity feature obtains classifier with mean value training.
Optionally, the classifier that training of the embodiment of the present invention obtains can be used for the speaker verification scene unrelated with text; Already described above, voice-based speaker's identity identification can be divided into speaker and recognize (Speaker Identification) and say It talks about people and confirms (Speaker Verification) two classes;And in terms of the requirement to voice, voice-based speaker's identity Identification can be divided into (Text-dependent) related to text and (Text-independent) two class unrelated with text again;With text What this correlation indicated is that the voice to be measured that speaker says need to compare with registration voice semanteme having the same applied to speaker The place of cooperation, what can not be indicated with text is the semantic content that can be not concerned in voice, and limiting factor is less, and application is more flexible Extensively.
It need to identify with explanation, the unrelated speaker's identity of text since the semantic content for voice is unrestricted, Therefore past in order to obtain preferable recognition performance under normal conditions the phenomenon that trained and test phase will appear voice mismatch Toward a large amount of training voice of needs;And classifier provided in an embodiment of the present invention, it is based on novel lower for voice requirement The training of speech identity feature obtain, therefore the identification accuracy of classifier can be greatly reduced as the duration of voice shortens and The case where decline, to realize that accurate speaker's identity identification provides possibility.
Optionally, classifier provided in an embodiment of the present invention can be PLDA (the linear distinguishing analysis of probability) classifier, and one Kind optionally can be as shown in figure 11 according to the process of speech identity feature training classifier, comprising:
Step 600, the mean value for determining the speech identity feature.
Assuming that after being extracted speech identity feature to i-th voice from speaker L, it may be determined that speech identity feature Mean value ysi
Step S610, the regular processing of covariance in class and the regular place of L2 norm are carried out to the mean value of the speech identity feature Reason, the feature that obtains that treated, with treated feature training classifier.
Optionally, in the mean value y to speech identity featureliCarry out the regular processing of covariance in class and the regular processing of L2 norm Afterwards, feature that treated can be used as training data training and obtain PLDA classifier.
Optionally, for nonparametric distinguishing analysis algorithm will be effectively embedding in PLDA classifier, PLDA classification is promoted The training precision of device, the available PLDA classifier of the embodiment of the present invention can be based on following two kinds of covariance matrixes (i.e. Covariance matrix between covariance matrix and nonparametric class in following classes) nonparametric PLDA model:
(1) covariance matrix in class, calculation can be such that
Wherein, S (capitalization) indicates speaker's number, and s (small letter) indicates s-th of speaker, HsIndicate s-th of speaker's Voice strip number, usFor the mean value of s-th of speaker.
(2) covariance matrix between nonparametric class can be used following formula and calculate:
Wherein,Indicate in the feature for illustrating people k with feature ysiQ-th of feature of arest neighbors, Q are neighbour The sum of feature, mk(ysi) mean value of Q neighbour's feature is represented, g (s, k, i) represents a weighting function, is defined as follows:
Wherein, index parameters α is the metric function d (y that adjusts the distance1, y2) weighting adjust, d (y1, y2) refer to feature y1And y2 Between Euclidean distance measurement, the value of parameter Q is generally set to the mean value of all total voice strip numbers of each speaker, weight Function g (s, k, i) has evaluated the feature y after projectionsiThe degree of closeness on the classification boundary between local speaker, to determine This feature ysiTo nonparametric class scatter matrix ΦbContribution degree.If feature ysiIf classification boundary, weighting function G (s, k, i) is maximized 0.5, if far from classification boundary, the value of weighting function g (s, k, i) becomes smaller feature ysi therewith.
Feature in formula above refers to speech identity feature.
In acquirement class between covariance matrix and nonparametric class after covariance matrix, the embodiment of the present invention can classify PLDA Transformation matrix replaces with covariance matrix in class in class in device scoring function, and transformation matrix replaces between the nonparametric class between class Covariance matrix, specifically for given registered first speech identity feature extractor y1And second speech identity it is special Levy extractor y2, constant term is omitted, then score (accuracy that score the illustrates PLDA classifier) calculating of PLDA classifier can As following formula is realized:
score(y1,y2)=(y1-μ)TΦw(y1-μ)+2(y1-μ)TΦb(y2-μ)+(y2-μ)TΦw(y2-μ)
Wherein, u is population mean, i.e. the mean value of F-vector training set.
After training obtains classifier, the embodiment of the present invention can be based on the voice for realizing target speaker to classifier and its The higher speaker verification of precision is realized in the identification of the voice of his speaker.When carrying out the identification of speaker, the present invention Embodiment can extract the speech feature vector of current speaker, use speech identity feature extraction provided in an embodiment of the present invention Device, the speech feature vector based on current speaker extract corresponding speech identity feature, which is inputted For the classifier of target speaker training, by the output of classifier as a result, recognizing whether current speaker is that target is spoken People realizes the identification of current speaker.
Optionally, the simplification process of training extractor of the embodiment of the present invention and classifier can be as shown in figure 12: by training language Supervision message of the corresponding I-vector of sound as DNN model, establishes speech feature vector reflecting to I-vector feature space It penetrates, extracts I-vector, DNN model is trained using I-vector as target;It is subsequent in order to obtain more compact say People's information is talked about, identity factor hidden variable, identity-based factor hidden variable are determined in SNR-invariant PLDA modeling process Supervision message DNN model is finely adjusted again, obtain final F-vector extractor;And then with F-vector extractor The F-vector for extracting voice realizes the PLDA classifier for speaker's identity identification based on F-vector.
Speech identity feature extractor training device provided in an embodiment of the present invention is introduced below, it is described below Speech identity feature extractor training device may be considered electronics and equipment (form of electronic equipment such as server or terminal Deng) to realize the present invention embodiment provide speech identity feature extractor training method, the program module of required setting.Hereafter The speech identity feature extractor training device of description can be mutual with above-described speech identity feature extractor training method To should refer to.
Figure 13 is the structural block diagram of speech identity feature extractor training device provided in an embodiment of the present invention, referring to figure 13, which may include:
The first extraction module of speech feature vector 100, for extracting the speech feature vector of trained voice;
Identity factor determining module 110 determines the trained language for the speech feature vector according to the trained voice The corresponding I-vector of sound;
First training module 120, for the first object output using the I-vector as neural network model, to mind Weight through network model is adjusted, and obtains first nerves network model;
First result determining module 130 determines first mind for obtaining the speech feature vector of target detection voice Through network model to the output result of the speech feature vector of the target detection voice;
Hidden variable determining module 140, for being exported according to described as a result, determining identity factor hidden variable;
Second training module 150, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean as institute The the second target output for stating first nerves network model, adjusts the weight of the first nerves network model, obtains speech identity Feature extractor.
Optionally, the first training module 120, it is defeated for the first object using the I-vector as neural network model Out, the weight of neural network model is adjusted, obtains first nerves network model, specifically includes:
Input speech feature vector is determined according to the speech feature vector of the trained voice;
Using the input speech feature vector as the input of neural network model, the I-vector is as neural network The first object of model exports, and the mean square error between each output and first object output of neural network model is as loss Function is adjusted the weight of neural network model, obtains first nerves network model.
Optionally, first training module 120 is inputted for being determined according to the speech feature vector of the trained voice Speech feature vector specifically includes:
The speech feature vector for splicing the adjacent setting number of frames of training voice obtains input speech feature vector.
Optionally, Figure 14 shows another knot of speech identity feature extractor training device provided in an embodiment of the present invention Structure block diagram, in conjunction with shown in Figure 13 and Figure 14, which can also include:
Model initialization module 160, for be layered initial method initialization neural network model.
Optionally, model initialization module 160, can be to nerve to be layered initial method initialization neural network model Before the weight of network model is adjusted;Correspondingly, the first training module 120 can neural network model after initial base Function realization is carried out on plinth.
Optionally, hidden variable determining module 140, for being exported according to described as a result, determining identity factor hidden variable, specifically Include:
The mean value for determining the output result, trains the constant SNR-invariant PLDA model of signal-to-noise ratio with the mean value, Identity factor hidden variable is calculated in the training process.
Optionally, hidden variable determining module 140, for identity factor hidden variable to be calculated in the training process, specifically Include:
According to formulaIdentity factor hidden variable h is calculatedi
Wherein, VsiIndicate first nerves network model to the defeated of the speech feature vector of i-th voice of s-th of speaker The mean value of result out, b indicate the corresponding signal-to-noise ratio section of target detection voice, and m indicates mean value, and R indicates speaker information space, U indicates signal-to-noise ratio space, gbIndicate noise specific factor,Indicate residual error item.
Optionally, the second training module 150, for using the Posterior Mean as the first nerves network model The output of two targets, adjusts the weight of the first nerves network model, obtains speech identity feature extractor, specifically include:
Splice the speech feature vector of target detection voice adjustment settings number of frames as the defeated of first nerves network model Enter, is exported by the second target of first nerves network model of the Posterior Mean of identity factor hidden variable, first nerves network mould The mean square error that type is exported every time between the output of the second target is loss function, is adjusted to first nerves network model, Obtain speech identity feature extractor.
Optionally, identity factor determining module 110 determines institute for the speech feature vector according to the trained voice The corresponding I-vector of trained voice is stated, is specifically included:
Sufficient statistic is determined according to the speech feature vector of trained each frame of voice based on GMM model;
Total variation space matrix is determined according to the sufficient statistic;According to total variation space matrix, determine described in The training corresponding I-vector of voice.
The embodiment of the present invention also provides a kind of electronic equipment, and the speech identity feature extractor training device of foregoing description can It being loaded into electronic equipment by program form, Figure 15 shows the hardware configuration of electronic equipment, and referring to Fig.1 5, the electronic equipment It may include: at least one processor 1, at least one communication interface 2, at least one processor 3 and at least one communication bus 4;
In embodiments of the present invention, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one, And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4;
Optionally, processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the one of the embodiment of the present invention A or multiple integrated circuits.
Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory), a for example, at least magnetic disk storage.
Wherein, memory is stored with executable program, which can be called by processor and execute, which can be used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding I-vector of trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is carried out Adjustment, obtains first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection The output result of the speech feature vector of voice;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the first nerves network model The output of second target, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
Optionally, the function refinement of the program and extension function can refer to the description of corresponding portion above, such as referring to voice The description of identity characteristic extractor training method part.
The embodiment of the present invention also provides a kind of classifier training device, and classifier training device described below can consider It is electronics and equipment (form of electronic equipment such as server or terminal etc.) the classifier instruction that embodiment provides to realize the present invention Practice method, the program module of required setting.Classifier training device described below can be with above-described classifier training side Method corresponds to each other reference.
Figure 16 is the structural block diagram of classifier training device provided in an embodiment of the present invention, referring to Fig.1 6, classifier instruction Practicing device may include:
Target detection voice obtains module 200, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector 210, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module 220, for calling the speech identity feature extractor of pre-training, by the target The speech feature vector for detecting voice inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, described Speech identity feature extractor is obtained by target output training of identity factor hidden variable;
Training module 230, for according to speech identity feature training classifier.
Optionally, training module 230, for specifically including according to speech identity feature training classifier:
Determine the mean value of the speech identity feature;It is regular that covariance in class is carried out to the mean value of the speech identity feature Processing and the regular processing of L2 norm, the feature that obtains that treated, with treated feature training classifier.
Optionally, in embodiments of the present invention, classifier can be based on PLDA model, correspondingly, the classifier can be PLDA classifier;For the precision for promoting classifier, transformation matrix can be replaced in the class in the scoring function of the PLDA classifier It is changed to covariance matrix in class, transformation matrix can be replaced by covariance matrix between the nonparametric class between class.
Optionally, classifier training device can be loaded into electronic equipment by program form, the structure of the electronic equipment It can refer to shown in Figure 15, comprising: at least one processor;The memory is stored with executable program, which can be specific For:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted language by the speech identity feature extractor for calling pre-training Sound identity characteristic extractor obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the identity factor Hidden variable is that target output training obtains;
According to speech identity feature training classifier.
The embodiment of the present invention can realize the training of novel speech identity feature extractor, the voice body obtained by training Part feature extractor, it can be achieved that the novel speech identity feature of high reliability extraction;And then it can be based on the novel voice Identity characteristic realizes the higher classifier training of precision, and the classifier obtained based on training can promote speaker's identity identification Accuracy.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments in the case where not departing from core of the invention thought or scope.Therefore, originally Invention is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein Consistent widest scope.

Claims (15)

1. a kind of speech identity feature extractor training method characterized by comprising
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding identity factor I-vector of the trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is adjusted It is whole, obtain first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection voice Speech feature vector output result;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the second of the first nerves network model Target output, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
2. speech identity feature extractor training method according to claim 1, which is characterized in that described with the I- Vector is exported as the first object of neural network model, is adjusted to the weight of neural network model, and the first mind is obtained Include: through network model
Input speech feature vector is determined according to the speech feature vector of the trained voice;
Using the input speech feature vector as the input of neural network model, the I-vector is as neural network model First object output, the mean square error between each output and first object output of neural network model is as loss letter Number, is adjusted the weight of neural network model, obtains first nerves network model.
3. speech identity feature extractor training method according to claim 2, which is characterized in that described according to the instruction The speech feature vector for practicing voice determines that input speech feature vector includes:
The speech feature vector for splicing the adjacent setting number of frames of training voice obtains input speech feature vector.
4. speech identity feature extractor training method according to claim 1-3, which is characterized in that described right Before the weight of neural network model is adjusted further include:
To be layered initial method initialization neural network model.
5. speech identity feature extractor training method according to claim 1, which is characterized in that described according to described defeated Out as a result, determining that identity factor hidden variable includes:
The mean value of the output result is determined, with the mean value training linear distinguishing analysis SNR- of signal-to-noise ratio invariant probability Identity factor hidden variable is calculated in invariant PLDA model in the training process.
6. speech identity feature extractor training method according to claim 5, which is characterized in that described in training process In identity factor hidden variable be calculated include:
According to formulaIdentity factor hidden variable h is calculatedi
Wherein, VsiIndicate first nerves network model to the output knot of the speech feature vector of i-th voice of s-th of speaker The mean value of fruit, b indicate the corresponding signal-to-noise ratio section of target detection voice, and m indicates mean value, and R indicates speaker information space, U table Show signal-to-noise ratio space, gbIndicate noise specific factor,Indicate residual error item.
7. speech identity feature extractor training method according to claim 1, which is characterized in that described with the posteriority Mean value is exported as the second target of the first nerves network model, is adjusted the weight of the first nerves network model, is obtained Include: to speech identity feature extractor
Splice input of the speech feature vector of target detection voice adjustment settings number of frames as first nerves network model, with The Posterior Mean of identity factor hidden variable is that the second target of first nerves network model exports, and first nerves network model is each Mean square error between output and the output of the second target is loss function, is adjusted to first nerves network model, obtains language Sound identity characteristic extractor.
8. speech identity feature extractor training method according to claim 1, which is characterized in that described according to the instruction The speech feature vector for practicing voice, determines that the corresponding I-vector of trained voice includes:
Based on gauss hybrid models GMM model, according to the speech feature vector of trained each frame of voice, sufficiently statistics is determined Amount;
Total variation space matrix is determined according to the sufficient statistic;
According to total variation space matrix, the corresponding I-vector of trained voice is determined.
9. a kind of classifier training method characterized by comprising
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted voice body by the speech identity feature extractor for calling pre-training Part feature extractor, obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the hidden change of the identity factor Amount is that target output training obtains;
According to speech identity feature training classifier.
10. classifier training method according to claim 9, which is characterized in that described according to the speech identity feature Training classifier include:
Determine the mean value of the speech identity feature;
The regular processing of covariance in class and the regular processing of L2 norm are carried out to the mean value of the speech identity feature, after obtaining processing Feature, with treated feature training classifier.
11. classifier training method according to claim 9 or 10, which is characterized in that the classifier is based on probability line Property distinguishing analysis PLDA model, the classifier be PLDA classifier;Become in class in the scoring function of the PLDA classifier It changes matrix and is replaced by covariance matrix in class, transformation matrix is replaced by covariance matrix between nonparametric class between class.
12. a kind of speech identity feature extractor training device characterized by comprising
The first extraction module of speech feature vector, for extracting the speech feature vector of trained voice;
Identity factor determining module determines that the trained voice is corresponding for the speech feature vector according to the trained voice Identity factor I-vector;
First training module, for the first object output using the I-vector as neural network model, to neural network The weight of model is adjusted, and obtains first nerves network model;
First result determining module determines the first nerves network for obtaining the speech feature vector of target detection voice Output result of the model to the speech feature vector of the target detection voice;
Hidden variable determining module, for being exported according to described as a result, determining identity factor hidden variable;
Second training module, for estimating the Posterior Mean of identity factor hidden variable, using the Posterior Mean as described first Second target of neural network model exports, and adjusts the weight of the first nerves network model, obtains speech identity feature and mention Take device.
13. a kind of electronic equipment characterized by comprising at least one processor;The memory is stored with executable journey Sequence, described program are used for:
Extract the speech feature vector of training voice;
According to the speech feature vector of the trained voice, the corresponding identity factor I-vector of the trained voice is determined;
It is exported using the I-vector as the first object of neural network model, the weight of neural network model is adjusted It is whole, obtain first nerves network model;
The speech feature vector for obtaining target detection voice, determines the first nerves network model to the target detection voice Speech feature vector output result;
According to the output as a result, determining identity factor hidden variable;
The Posterior Mean for estimating identity factor hidden variable, using the Posterior Mean as the second of the first nerves network model Target output, adjusts the weight of the first nerves network model, obtains speech identity feature extractor.
14. a kind of classifier training device characterized by comprising
Target detection voice obtains module, for obtaining the target detection voice of target speaker;
The second extraction module of speech feature vector, for extracting the speech feature vector of the target detection voice;
Speech identity characteristic extracting module, for calling the speech identity feature extractor of pre-training, by the target detection language The speech feature vector of sound inputs speech identity feature extractor, obtains corresponding speech identity feature;Wherein, the voice body Part feature extractor is obtained by target output training of identity factor hidden variable;
Training module, for according to speech identity feature training classifier.
15. a kind of electronic equipment characterized by comprising at least one processor;The memory is stored with executable journey Sequence, described program are used for:
Obtain the target detection voice of target speaker;
Extract the speech feature vector of the target detection voice;
The speech feature vector of the target detection voice is inputted voice body by the speech identity feature extractor for calling pre-training Part feature extractor, obtains corresponding speech identity feature;Wherein, the speech identity feature extractor is with the hidden change of the identity factor Amount is that target output training obtains;
According to speech identity feature training classifier.
CN201710910880.XA 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment Active CN109584884B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN201710910880.XA CN109584884B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment
CN201910741216.6A CN110310647B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment
PCT/CN2018/107385 WO2019062721A1 (en) 2017-09-29 2018-09-25 Training method for voice identity feature extractor and classifier and related devices
US16/654,383 US11335352B2 (en) 2017-09-29 2019-10-16 Voice identity feature extractor and classifier training
US17/720,876 US20220238117A1 (en) 2017-09-29 2022-04-14 Voice identity feature extractor and classifier training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710910880.XA CN109584884B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201910741216.6A Division CN110310647B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment

Publications (2)

Publication Number Publication Date
CN109584884A true CN109584884A (en) 2019-04-05
CN109584884B CN109584884B (en) 2022-09-13

Family

ID=65900669

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201710910880.XA Active CN109584884B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment
CN201910741216.6A Active CN110310647B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910741216.6A Active CN110310647B (en) 2017-09-29 2017-09-29 Voice identity feature extractor, classifier training method and related equipment

Country Status (3)

Country Link
US (2) US11335352B2 (en)
CN (2) CN109584884B (en)
WO (1) WO2019062721A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920435A (en) * 2019-04-09 2019-06-21 厦门快商通信息咨询有限公司 A kind of method for recognizing sound-groove and voice print identification device
CN110807333A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Semantic processing method and device of semantic understanding model and storage medium
CN113362829A (en) * 2021-06-04 2021-09-07 思必驰科技股份有限公司 Speaker verification method, electronic device and storage medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106169295B (en) * 2016-07-15 2019-03-01 腾讯科技(深圳)有限公司 Identity vector generation method and device
CN107945806B (en) * 2017-11-10 2022-03-08 北京小米移动软件有限公司 User identification method and device based on sound characteristics
US20190244062A1 (en) * 2018-02-04 2019-08-08 KaiKuTek Inc. Gesture recognition method, gesture recognition system, and performing device therefore
DE112018006885B4 (en) * 2018-02-20 2021-11-04 Mitsubishi Electric Corporation TRAINING DEVICE, LANGUAGE ACTIVITY DETECTOR AND METHOD FOR DETECTING LANGUAGE ACTIVITY
CN111583907B (en) * 2020-04-15 2023-08-15 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111524525B (en) * 2020-04-28 2023-06-16 平安科技(深圳)有限公司 Voiceprint recognition method, device, equipment and storage medium of original voice
CN112001215B (en) * 2020-05-25 2023-11-24 天津大学 Text irrelevant speaker identity recognition method based on three-dimensional lip movement
CN112259078A (en) * 2020-10-15 2021-01-22 上海依图网络科技有限公司 Method and device for training audio recognition model and recognizing abnormal audio
CN112164404A (en) * 2020-10-28 2021-01-01 广西电网有限责任公司贺州供电局 Remote identity authentication method and system based on voiceprint recognition technology
CN112466298B (en) * 2020-11-24 2023-08-11 杭州网易智企科技有限公司 Voice detection method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150149165A1 (en) * 2013-11-27 2015-05-28 International Business Machines Corporation Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors
US20160293167A1 (en) * 2013-10-10 2016-10-06 Google Inc. Speaker recognition using neural networks
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI223791B (en) * 2003-04-14 2004-11-11 Ind Tech Res Inst Method and system for utterance verification
JP2008191444A (en) 2007-02-06 2008-08-21 Nec Electronics Corp Display driver ic
CN101241699B (en) * 2008-03-14 2012-07-18 北京交通大学 A speaker identification method for remote Chinese teaching
CN102820033B (en) * 2012-08-17 2013-12-04 南京大学 Voiceprint identification method
US9406298B2 (en) * 2013-02-07 2016-08-02 Nuance Communications, Inc. Method and apparatus for efficient i-vector extraction
US10438581B2 (en) * 2013-07-31 2019-10-08 Google Llc Speech recognition using neural networks
CN103391201B (en) * 2013-08-05 2016-07-13 公安部第三研究所 The system and method for smart card identity checking is realized based on Application on Voiceprint Recognition
CN104765996B (en) * 2014-01-06 2018-04-27 讯飞智元信息科技有限公司 Voiceprint password authentication method and system
CN105261367B (en) * 2014-07-14 2019-03-15 中国科学院声学研究所 A kind of method for distinguishing speek person
US9373330B2 (en) * 2014-08-07 2016-06-21 Nuance Communications, Inc. Fast speaker recognition scoring using I-vector posteriors and probabilistic linear discriminant analysis
CN105096121B (en) * 2015-06-25 2017-07-25 百度在线网络技术(北京)有限公司 voiceprint authentication method and device
CN105139856B (en) * 2015-09-02 2019-07-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Probability linear discriminant method for distinguishing speek person based on the regular covariance of priori knowledge
CN105895078A (en) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method used for dynamically selecting speech model and device
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN105845140A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation method and speaker confirmation device used in short voice condition
CN106098068B (en) * 2016-06-12 2019-07-16 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN106169295B (en) * 2016-07-15 2019-03-01 腾讯科技(深圳)有限公司 Identity vector generation method and device
CN107785015A (en) * 2016-08-26 2018-03-09 阿里巴巴集团控股有限公司 A kind of audio recognition method and device
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN106847292B (en) * 2017-02-16 2018-06-19 平安科技(深圳)有限公司 Method for recognizing sound-groove and device
CN107039036B (en) * 2017-02-17 2020-06-16 南京邮电大学 High-quality speaker recognition method based on automatic coding depth confidence network
CN107146601B (en) * 2017-04-07 2020-07-24 南京邮电大学 Rear-end i-vector enhancement method for speaker recognition system
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
WO2019053898A1 (en) * 2017-09-15 2019-03-21 Nec Corporation Pattern recognition apparatus, pattern recognition method, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160293167A1 (en) * 2013-10-10 2016-10-06 Google Inc. Speaker recognition using neural networks
US20150149165A1 (en) * 2013-11-27 2015-05-28 International Business Machines Corporation Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920435A (en) * 2019-04-09 2019-06-21 厦门快商通信息咨询有限公司 A kind of method for recognizing sound-groove and voice print identification device
CN110807333A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Semantic processing method and device of semantic understanding model and storage medium
CN110807333B (en) * 2019-10-30 2024-02-06 腾讯科技(深圳)有限公司 Semantic processing method, device and storage medium of semantic understanding model
CN113362829A (en) * 2021-06-04 2021-09-07 思必驰科技股份有限公司 Speaker verification method, electronic device and storage medium

Also Published As

Publication number Publication date
CN110310647A (en) 2019-10-08
US11335352B2 (en) 2022-05-17
CN110310647B (en) 2022-02-25
US20220238117A1 (en) 2022-07-28
WO2019062721A1 (en) 2019-04-04
CN109584884B (en) 2022-09-13
US20200043504A1 (en) 2020-02-06

Similar Documents

Publication Publication Date Title
CN109584884A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN110189769B (en) Abnormal sound detection method based on combination of multiple convolutional neural network models
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
CN112259106B (en) Voiceprint recognition method and device, storage medium and computer equipment
CN104900235B (en) Method for recognizing sound-groove based on pitch period composite character parameter
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
US6253179B1 (en) Method and apparatus for multi-environment speaker verification
CN105096955B (en) A kind of speaker's method for quickly identifying and system based on model growth cluster
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
EP0822539A2 (en) Two-staged cohort selection for speaker verification system
CN102486922B (en) Speaker recognition method, device and system
CN108109613A (en) For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment
CN108986824A (en) A kind of voice playback detection method
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN109192224A (en) A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
CN111048097B (en) Twin network voiceprint recognition method based on 3D convolution
CN108091326A (en) A kind of method for recognizing sound-groove and system based on linear regression
CN109448732A (en) A kind of digit string processing method and processing device
CN110111798A (en) A kind of method and terminal identifying speaker
CN107545898B (en) Processing method and device for distinguishing speaker voice
Fasounaki et al. CNN-based Text-independent automatic speaker identification using short utterances
CN100570712C (en) Based on anchor model space projection ordinal number quick method for identifying speaker relatively
Weng et al. The sysu system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge
Herrera-Camacho et al. Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant