CN110047504A - Method for distinguishing speek person under identity vector x-vector linear transformation - Google Patents
Method for distinguishing speek person under identity vector x-vector linear transformation Download PDFInfo
- Publication number
- CN110047504A CN110047504A CN201910312097.2A CN201910312097A CN110047504A CN 110047504 A CN110047504 A CN 110047504A CN 201910312097 A CN201910312097 A CN 201910312097A CN 110047504 A CN110047504 A CN 110047504A
- Authority
- CN
- China
- Prior art keywords
- vector
- identity
- speaker
- linear transformation
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 332
- 230000009466 transformation Effects 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 38
- 238000000556 factor analysis Methods 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 239000000284 extract Substances 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000012360 testing method Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000000717 retained effect Effects 0.000 claims description 2
- 230000015654 memory Effects 0.000 abstract description 7
- 238000011156 evaluation Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Complex Calculations (AREA)
Abstract
The present invention relates to the method for distinguishing speek person under a kind of identity vector x-vector linear transformation, have main steps that: carrying out feature extraction to voice, and extract its identity vector x-vector and i-vector respectively;The training of parallel factor analysis device is carried out using the x-vector and i-vector of the same speaker;The corresponding parameter of x-vector in parallel factor analysis device is chosen, linear transformation is carried out to identity vector x-vector on this parameter basis and obtains xl-vector;To new identity vector xl-vector training PLDA model;Voice to be tested is carried out feature extraction and x-vector to extract, the linear quantizer that the training stage obtains is entered into and obtains new identity vector xl-vector, the PLDA model that the training stage obtains finally is entered into, to obtain final result.The present invention improves the recognition performance of Speaker Identification while guaranteeing that memory requirements is similar with baseline system with calculating speed.
Description
Technical field
The present invention relates to the present invention relates to the speaker Recognition Technologies in bio-identification, more specifically to a kind of identity
Speaker Recognition Technology under vector x-vector linear transformation.
Background technique
Voice is the mode for the most direct convenience that the mankind carry out communication exchange, it is with its distinctive convenience, economy, standard
The advantage of the various aspects such as true property causes the attention of each research institution.The research of Speech processing to promote human-computer interaction,
Artificial Intelligence Development has great significance.For this purpose, the related fields of Speech processing, such as speech recognition, voice coding, language
The directions such as sound synthesis, Speaker Identification receive more and more attention and theoretical research.Speaker Identification, also known as Application on Voiceprint Recognition,
Its goal in research is to carry out authentication according to unique pronunciation of each speaker.The voice of each speaker suffers from unique
Personal presentation, this is because the born phonatory organ of each speaker is different, while the such environmental effects where by the day after tomorrow and
The one's own unique voice being trained.Just because of this otherness, so that voice is special as a kind of biology
Property as identification target be possibly realized, Speaker Identification has also gradually formed a set of fairly perfect identification system of oneself.
Speaker Recognition System includes preprocessing part, characteristic extraction part, model training and matching primitives part.It says
The key technology for talking about people's identification includes characteristic parameter extraction algorithm, the selection of model and Model Matching algorithm, directly determines knowledge
The performance of other system.Speaker model, which is divided into, generates model and discrimination model.Generating model is each respective spy of classification of study
Sign, i.e., multiple models, identification data are mapped in each model, and then determine which kind of identification data belong to;Discrimination model is
Learning classification face, the classifying face can be used to distinguish which kind of different data are belonging respectively to.The two models are based on global poor
The identity vector i-vector of anomalous mode type (TotalVariabilityModeling, TVM), Delayed Neural Networks are based on
The identity vector x-vector of (Time-delayDeepNeuralNetwork, TDNN) is representative, is that current use is most extensive
Two vector models.
The rear end part and i-vector rear end part of x-vector generally uses probability linear discriminant analysis (prob
Abilisticlineardiscriminantanalysis, PLDA) rear end methods of marking.Result under x-vector model
With the result of i-vector when long under voice quite, result is more preferable under Short Time Speech.How different paper studies mention
System performance under high x-vector model, research shows that the model superposition or PLDA by i-vector and x-vector obtain
Divide fusion that system performance can be improved, however this kind of method is designed into two kinds of systems, needs a large amount of memory requirements, calculates simultaneously
Speed also will receive influence.Then, more researchs improve the robustness of x-vector by way of data extending, still
This method is influenced by environment-identification.
Summary of the invention
The object of the present invention is to provide a kind of amount of ram for considering online recognition target speaker and calculate speaking for time
People's recognition methods.
In order to achieve the above object, the technical solution of the present invention is to provide a kind of identity-based vector x-vector lines
Property transformation under method for distinguishing speek person, which comprises the steps of:
Step 1, extract speaker training voice feature of the mel-frequency cepstrum coefficient as speaker;
Step 2 uses deep neural network structured training x-vector model using the feature that step 1 obtains, and establishes body
Part vector x-vector model, to obtain identity vector x-vector;
Step 3 is based on EM algorithm training i-vector model using the feature that step 1 obtains, and establishes identity vector i-
Vector model, to obtain identity vector i-vector;
Step 4 thinks that the i-vector of the same speaker and x-vector are projected in the same vector, is calculated based on EM
Method training obtains the parameter of parallel factor analysis device, to complete the training of parallel factor analysis device;
Step 5 passes through linear quantizer, and the corresponding parameter of x-vector is retained in the parameter of parallel factor analysis device,
On the basis of Linear Transformation device, identity vector xl-vector is expressed with the linear transformation of x-vector, to establish body
Part vector xl- vector model obtains identity vector xl-vector;
Step 6 utilizes identity vector xl- vector is updated the parameter model of PLDA using EM algorithm, completion pair
The training of PLDA model;
Step 7, the Speaker Identification of test phase
Identity vector x-vector model will be passed through after the corresponding voice progress feature extraction to be identified of registration voice
Identity vector x-vector is obtained, the linear quantizer after identity vector x-vector input training is obtained into new identity vector
xl- vector, finally by identity vector xl- vector is input to the PLDA model after training, to obtain Speaker Identification knot
Fruit.
Preferably, in step 4, it is contemplated that different identity vector may map to the same vector space, using it is parallel because
The method of son analysis obtains this common vector.
Preferably, in step 4, the identity vector i-vector of first of speaker is expressed as φi(l, 1) ..., φi(l,
K), identity vector x-vector is expressed as φx(l, 1) ..., φx(l, k), wherein k indicates the input voice of the speaker
Quantity, φi(l, k) indicates identity the vector i-vector, φ of the kth section voice of the identity vector of first of speakerx(l, k) table
Show the identity vector x-vector of the kth section voice of the identity vector of first of speaker, the identity vector i- of the same speaker
Vector and identity vector x-vector can be projected in the same vector, therefore can be expressed asWherein, μiIndicate the average vector of identity vector i-vector;μxIt indicates
The average vector of identity vector x-vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding throwing of x-vector
Shadow matrix;H (l) indicates the hidden variable of first of speaker;εi(l, k) indicates the kth section language of the identity vector of first of speaker
Residual vector after the identity vector i-vector Linear Transformation of sound, εi~N (0, ∑i), ∑iIndicate the association side of i-vector
Poor matrix, and N (0, ∑i) indicate εiMeeting matrix is 0, and covariance is ∑iNormal distribution;εx(l, k) indicates to say for first
Residual vector after talking about the identity vector x-vector Linear Transformation of the kth section voice of the identity vector of people, εx~N (0,
∑x), ∑xIndicate residual epsilonxCovariance matrix, N (0, ∑x) indicate εxMeeting matrix is 0, and covariance is ∑xNormal distribution;
By EM algorithm, parameter θ={ μ of parallel factor analysis device is obtainedi, Fi, ∑i, μx, Fx, ∑x}。
Preferably, in step 6, according to the corresponding parameter θ of x-vectorx={ μx, Fx, ∑xOn, after linear transformation
Identity vector xl- vector is expressed asWherein,Indicate xlThe posteriority covariance of-vectorFurther write it as φxl=A φxThe form of-b, A, b are linear dimensions, so that identity be sweared
Measure xl- vector is expressed as the linear transformation mode of x-vector.
The present invention in view of i-vector generation model information be it is helpful to x-vector model system,
Training stage introduces i-vector, obtains the matrix of a linear transformation suitable for x-vector, and propose that a kind of x-vector is linear
Method for distinguishing speek person under transformation.
In step 4 of the present invention, using x-vector and i-vector training parallel factor analysis device, in this way this point
Parser had not only contained the information of x-vector, but also contained the information of i-vector, therefore obtained on the basis of this analyzer
The linear quantizer of x-vector preferably remains the information of i-vector, so that new identity vector xl-vector
With i-vector information, the final recognition performance for improving system.
Step of the present invention does not need to carry out i- again in the test phase of step 7 after the completion of the step of training stage of 1-6
Vector identity vector extracts, while parallel factor analysis device is after the training stage obtains, it is only necessary to retain the line of x-vector
Property converter, therefore the memory requirements of test phase does not increase, while linear transformation has little effect actual operation.
The method of the present invention is to carry out in Speaker Identification using the identity vector after a kind of pair of x-vector linear transformation
Identification.By rationally using i-vector information during the test, achieve the effect that improve recognition performance.Specifically
It is exactly in test phase, by carrying out parallel factor analysis device instruction using the x-vector and i-vector of the same speaker
Practice, choose parallel factor analysis device in the corresponding parameter of x-vector, on this parameter basis to identity vector x-vector into
Row linear transformation obtains xl-vector;In test phase, voice to be tested is subjected to feature extraction and x-vector is extracted,
It is entered into the linear quantizer that the training stage obtains and obtains new identity vector xl- vector is finally entered into instruction
The PLDA model that the white silk stage obtains, to obtain final result.
Thus can produce it is such the utility model has the advantages that
(1) using x-vector and i-vector training parallel factor analysis device, this analyzer so both contains x-
The information of vector, but the x-vector's for containing the information of i-vector, therefore obtaining on the basis of this analyzer is linear
Converter preferably remains the information of i-vector, so that new identity vector xl- vector believes with i-vector
Breath, the final recognition performance for improving system;
(2) test phase does not need to carry out the extraction of i-vector identity vector again, while parallel factor analysis device is in training
After stage obtains, it is only necessary to retain the linear quantizer of x-vector, therefore the memory requirements of test phase does not increase, simultaneously
Linear transformation has little effect actual operation.
Detailed description of the invention
Fig. 1 is the Speaker Identification flow chart under present invention implementation identity vector x-vector linear transformation;
Fig. 2 is the parameter setting situation of frame number layer in x-vector neural network framework.
Specific embodiment
Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention
Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art
Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited
Range.
The side of speaker Recognition Technology under a kind of identity vector x-vector linear transformation disclosed by the embodiments of the present invention
Method, as shown in Figure 1, comprising the following steps:
Step 1, feature extraction --- the present invention use mel-frequency cepstrum coefficient (Mel Frequency Cepstral
Coefficients, MFCC) feature as speaker.The log series model that mel-frequency scale is corresponding generally to actual frequency closes
System: Mel (f)=2595lg (1+f/700), in formula, Mel (f) indicates that mel-frequency, f indicate common frequencies.According to following behaviour
Make the available MFCC feature of mode: (1) pre-processing, including preemphasis, framing adding window, end-point detection, if voice signal x (m)
It is after pretreatment xi(m), i indicates frame number;(2) Fast Fourier Transform (FFT) X (i, k)=FFT [xi(m)], X (i, k) indicates frequency
Spectrum signal;(3) line energy calculates E (i, k)=[X (i, k)]2;(4) Meier filter energy is calculated Wherein HmIt (k) is Meier filter function, M indicates the number of filter;(5) dct transform and logarithm is sought.
Step 2, identity vector x-vector model foundation --- x-vector model training are based on deep neural network knot
Structure.First 5 layers are that frame level is other, and TDNN total input is a Duan Yuyin, and each TDNN takes fixed frame number, and the network parameter of first five layer is set
Definite opinion Fig. 2.Then after pond layer accumulates the output vector of each TDNN, mean value and standard deviation are calculated as pond layer
Output.Followed by two layers of omnidirectional's articulamentum of pond layer adds one softmax layers finally to export.The neuron number of output and
Speaker's number is consistent in our training sets, and the output of neural network is a posterior probability.It is logical based on the neural network
Crossing multiple repetitive exercise uses layer 6 output as x-vector model.
Step 3, identity vector i-vector model foundation --- the voice sequence of a given speaker s is O={ o1,
o2..., oT, c-th of Gaussian component can be expressed as o in the voice sequence of t momentC, t=μc+TcX+ ε, oC, tIndicate c
Voice sequence of a Gaussian component in t moment, μcIndicate the mean value of c-th of Gaussian component, TcIndicate the throwing of c-th of Gaussian component
Shadow matrix, x indicate the hidden variable of the speaker, and ε indicates residual error portion, select EM algorithm training i-vector model.Wherein E is walked
Suddenly in (seeking desired value), first order statistic FcWith second-order statistic ScDefinition be respectively as follows: Fc=∑tγc(t)(oC, t-μc), Sc
=∑tγc(t)(oC, t-μc)(oC, t-μc)T, γc(t) indicate t frame voice in c-th of Gaussian component occupation rate, the posteriority of x
Mean value is expressed as φ=L-1TT∑-1F, wherein L-1Indicate the posteriority covariance of identity vector i-vector,NcIndicate the zero order statistical amount of c-th of Gaussian component, I indicates that unit vector, T indicate all Gausses point
Measure TcThe matrix of composition, F indicate that first order statistic, ∑ indicate the covariance matrix of residual epsilon.M step (maximization) main purpose
It is optimization matrix T and matrix ∑, by rightDerivation obtains the two matrixes
Optimal solution, F (s) indicate that the first order statistic of s sections of voices, x (s) indicate the hidden variable of s sections of voices, and N (s) indicates s
The zero order statistical amount of Duan Yuyin.By establishing i-vector model to E the and M step process that successively iteration updates.
Step 4, parallel factor analysis device --- the identity vector i-vector of first of speaker is expressed as φ for trainingi(l,
..., φ 1)i(l, k), identity vector x-vector are expressed as φx(l, 1) ..., φx(l, k), wherein k indicates the speaker
Input voice quantity, φi(l, k) indicates the identity vector i- of the kth section voice of the identity vector of first of speaker
Vector, φx(l, k) indicates the identity vector x-vector of the kth section voice of the identity vector of first of speaker, same
The identity vector i-vector and identity vector x-vector of speaker can be projected in the same vector, therefore can be indicated
ForWherein, μiIndicate the average vector of identity vector i-vector;μxIndicate identity vector x-
The average vector of vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding projection matrix of x-vector;h
(l) hidden variable of first of speaker is indicated;εi(l, k) indicates the identity of the kth section voice of the identity vector of first of speaker
Residual vector after vector i-vector Linear Transformation, εi~N (0, ∑i), ∑iIndicate the covariance matrix of i-vector, N
(0, ∑i) indicate εiMeeting matrix is 0, and covariance is ∑iNormal distribution;εx(l, k) indicates the body of first of speaker
Residual vector after the identity vector x-vector Linear Transformation of the kth section voice of part vector, εx~N (0, ∑x), ∑xIt indicates
Residual epsilonxCovariance matrix, N (0, ∑x) indicate εxMeeting matrix is 0, and covariance is ∑xNormal distribution;By EM algorithm,
Obtain parameter θ={ μ of parallel factor analysis devicei,Fi,Σi,μx,Fx,Σx}。
Step 5, linear quantizer --- i-vector and x- are contained in the parallel factor analysis device that the training stage obtains
The parameter of vector, in practical on-line operation, it is only necessary to the corresponding parameter θ of x-vectorx={ μx, Fx, ∑x}.In this parameter
On obtain the identity vector x after linear transformationl- vector model.
Step 6, identity vector xl-vector model foundation --- in the corresponding parameter θ of x-vectorx={ μx, Fx, ∑x}
On, by the identity vector x after linear transformationl- vector is expressed as Indicate identity vector x-
The posteriority covariance of vectorFurther write it as φxl=A φxThe form of-b, A, b are linear
Parameter, thus by identity vector xl- vector is expressed as the linear transformation mode of x-vector.
Step 7, PLDA model training --- assuming that training data voice is made of the voice of i speaker, wherein each
Speaker has j sections of oneself different voices.So, the j-th strip voice that we define i-th of speaker is xij.Then, according to because
Sub- analytic definition xijGeneration model are as follows: xij=μ+Fhi+Gwij+εij, μ expression mean value vector, F expression speaker information matrix,
hiIndicate the hidden variable of i-th of speaker, G indicates channel information matrix, wijIndicate the j-th strip voice of i-th of speaker
The hidden variable of channel, εijIndicate the residual error portion of the j-th strip voice of i-th of speaker.Using EM algorithm to the parameter mould of PLDA
Type is updated.
Step 8, the Speaker Identification of test phase --- the corresponding voice to be identified of registration voice is subjected to feature
It extracts and x-vector is extracted, be entered into the linear quantizer that the training stage obtains and obtain new identity vector xl-
Vector is finally entered into the PLDA model that the training stage obtains, to obtain final result.
The method of the present invention is emulated and analyzed below.
In 2010 test set of NIST SRE, to the Speaker Identification under x-vector identity vector after linear transformation
Simulating, verifying is can be carried out.The test set includes the test assignment of 9 scenes (common condition, CC), includes interview
(interview), the data of microphone (microphone) and telephone channel (telephone), wherein telephone channel also for
Speaker's style different volume incorporated above mainly includes louder volume (high vocal effort), usual volume
(normal vocal effort) and amount of bass (low vocal effort).The present invention uses the 5th scene (CC ' 5), i.e.,
Scene based on different telephone channels under usual volume.Evaluation standard using etc. error rates (Equal Error Rate, EER) with
And the performance of Detectability loss function (Detection Cost Function, DCF) Lai Hengliang Speaker Recognition System.
In tri- task test sets of coreext-coreext, core-10sec, 10sec-10sec of NIST SRE 2010
It closes and is tested, wherein coreext and core refers to that long Shi Yuyin, 10sec refer to Short Time Speech.It is used in emulation
The voice data of Switchboard2, Switchboard Cellular and NIST SRE 2004 to 2008 is as training
Data.Experiment is using x-vector and i-vector system as baseline model.Men and women's sound UBM is trained together, x-vector model
Use acoustic feature for 20 dimension MFCC feature, i-vector model using it is same 20 dimension mfcc static nature parameters and
Its single order and second differnce, i.e. 60 dimensional features.To each section of voice segments, 600 dimension i-vector vectors and 512 dimension x- are respectively obtained
Vector vector.Identity vector dimensionality reduction to 400 is tieed up with the method for LDA in baseline system, then trains speaker's sky
Between order be 200 dimensions, channel space order is 0 dimension, and the PLDA model of full variance matrix.Xl-vector proposed by the present invention exists
Speaker's variable between class distance maximum, the smallest factor of inter- object distance are had been contemplated that in design process, therefore do not use LDA to walk
Suddenly.
Table 1 is in tri- tasks of coreext-coreext, core-10sec, 10sec-10sec, and homologous ray is not in EER
The comparison of evaluation criterion and DCF evaluation criterion, the dimension of xl-vector are 512.Wherein i-vector and x-vector is two
A baseline system, emerging system are the system that i-vector is added with the score of the PLDA model of x-vector.?
In tri- tasks of coreext-coreext, core-10sec, 10sec-10sec, xl-vector proposed by the present invention is commented in EER
It is better than two baseline systems in price card standard, in DCF evaluation criterion in 10sec-10sec task slightly with respect to x-vector system
There is reduction, other two tasks are better than two baseline systems.Xl-vector system is compared to emerging system in coreext-
EER advantage in coreext task is obvious, memory and calculating speed phase needed for xl-vector and x-vector
Seemingly, however emerging system needs to consider x-vector and i-vector, therefore more memories, calculating speed are needed when operation
It is slack-off.To sum up, xl-vector of the invention suffers from apparent advantage compared to two baseline systems and emerging system.
Table 1
Table 2 is the new identity vector in tri- tasks of coreext-coreext, core-10sec, 10sec-10sec
The comparison of xl-vector EER evaluation criterion and DCF evaluation criterion under different dimensions.It can be found that in coreext-coreext
With the increase of dimension in task, the performance of EER is become better and better, and when dimension is 500, performance is optimal value, is in dimension
When 512, optimal value is kept substantially;The performance of DCF remains unchanged substantially.In core-10sec and 10sec-10sec task with
The increase of dimension, the performance of EER worse and worse, dimension be 200 when, performance is optimal value;The transformation range of DCF maintains
Within 10%.To sum up, when test statement is long when sentence, dimension more high-performance is better, is sentence in short-term in test statement
When, dimension more low performance is better.
Table 2
It can be seen that the xl-vector model that inventor proposes passes through x-vector and i-vector in the training stage
Parallel factor analysis device obtains a linear transformation algorithm to x-vector, improves the performance of Speaker Recognition System, and
Keep memory demand and the unaffected advantage of calculating speed.
Claims (4)
1. the method for distinguishing speek person under identity-based vector x-vector linear transformation, which comprises the steps of:
Step 1, extract speaker training voice feature of the mel-frequency cepstrum coefficient as speaker;
Step 2 uses deep neural network structured training x-vector model using the feature that step 1 obtains, and establishes identity arrow
X-vector model is measured, to obtain identity vector x-vector;
Step 3 is based on EM algorithm training i-vector model using the feature that step 1 obtains, and establishes identity vector i-vector
Model, to obtain identity vector i-vector;
Step 4 thinks that the i-vector of the same speaker and x-vector are projected in the same vector, is instructed based on EM algorithm
The parameter of parallel factor analysis device is got, to complete the training of parallel factor analysis device;
Step 5 passes through linear quantizer, the corresponding parameter of x-vector is retained in the parameter of parallel factor analysis device, online
Property converter on the basis of, identity vector xl-vector is expressed with the linear transformation of x-vector, thus establish identity arrow
Measure xl- vector model obtains identity vector xl-vector;
Step 6 utilizes identity vector xl- vector is updated the parameter model of PLDA using EM algorithm, completes to PLDA mould
The training of type;
Step 7, the Speaker Identification of test phase
It will be obtained after the corresponding voice progress feature extraction to be identified of registration voice by identity vector x-vector model
Linear quantizer after identity vector x-vector input training is obtained new identity vector x by identity vector x-vectorl-
Vector, finally by identity vector xl- vector is input to the PLDA model after training, to obtain Speaker Identification result.
2. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that:
In step 4, it is contemplated that different identity vector may map to the same vector space, be obtained using the method for parallel factor analysis
This common vector.
3. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that:
In step 4, the identity vector i-vector of first of speaker is expressed as φi(l,1),…,φi(l, k), identity vector x-
Vector is expressed as φx(l,1),…,φx(l, k), wherein k indicates the quantity of the input voice of the speaker, φi(l, k) table
Show identity the vector i-vector, φ of the kth section voice of the identity vector of first of speakerx(l, k) indicates first of speaker
Identity vector kth section voice identity vector x-vector, the identity vector i-vector and identity of the same speaker
Vector x-vector can be projected in the same vector, therefore can be expressed asWherein, μiIndicate the average vector of identity vector i-vector;μxIt indicates
The average vector of identity vector x-vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding throwing of x-vector
Shadow matrix;H (l) indicates the hidden variable of first of speaker;εi(l, k) indicates the kth section language of the identity vector of first of speaker
Residual vector after the identity vector i-vector Linear Transformation of sound, εi~N (0, Σi), ΣiIndicate the association side of i-vector
Poor matrix, N (0, Σi) indicate εiMeeting matrix is 0, covariance ΣiNormal distribution;εx(l, k) indicates to say for first
Residual vector after talking about the identity vector x-vector Linear Transformation of the kth section voice of the identity vector of people, εx~N (0,
Σx), ΣxIndicate residual epsilonxCovariance matrix, N (0, Σx) indicate εxMeeting matrix is 0, covariance ΣxNormal distribution;
By EM algorithm, parameter θ={ μ of parallel factor analysis device is obtainedi,Fi,Σi,μx,Fx,Σx}。
4. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that:
In step 6, according to the corresponding parameter θ of x-vectorx={ μx,Fx,ΣxOn, by the identity vector x after linear transformationl-vector
It is expressed asWherein,Indicate xlThe posteriority covariance of-vector
Further write it as φxl=A φxThe form of-b, A, b are linear dimensions, thus by identity vector xl- vector is expressed as x-
The linear transformation mode of vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910312097.2A CN110047504B (en) | 2019-04-18 | 2019-04-18 | Speaker identification method under identity vector x-vector linear transformation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910312097.2A CN110047504B (en) | 2019-04-18 | 2019-04-18 | Speaker identification method under identity vector x-vector linear transformation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110047504A true CN110047504A (en) | 2019-07-23 |
CN110047504B CN110047504B (en) | 2021-08-20 |
Family
ID=67277768
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910312097.2A Active CN110047504B (en) | 2019-04-18 | 2019-04-18 | Speaker identification method under identity vector x-vector linear transformation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110047504B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081256A (en) * | 2019-12-31 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Digital string voiceprint password verification method and system |
CN111462759A (en) * | 2020-04-01 | 2020-07-28 | 科大讯飞股份有限公司 | Speaker labeling method, device, equipment and storage medium |
WO2021174883A1 (en) * | 2020-09-22 | 2021-09-10 | 平安科技(深圳)有限公司 | Voiceprint identity-verification model training method, apparatus, medium, and electronic device |
CN113689861A (en) * | 2021-08-10 | 2021-11-23 | 上海淇玥信息技术有限公司 | Intelligent track splitting method, device and system for single sound track call recording |
CN114974259A (en) * | 2021-12-23 | 2022-08-30 | 号百信息服务有限公司 | Voiceprint recognition method |
CN115273863A (en) * | 2022-06-13 | 2022-11-01 | 广东职业技术学院 | Compound network class attendance system and method based on voice recognition and face recognition |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105139857A (en) * | 2015-09-02 | 2015-12-09 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Countercheck method for automatically identifying speaker aiming to voice deception |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
US9685159B2 (en) * | 2009-11-12 | 2017-06-20 | Agnitio Sl | Speaker recognition from telephone calls |
US9792823B2 (en) * | 2014-09-15 | 2017-10-17 | Raytheon Bbn Technologies Corp. | Multi-view learning in detection of psychological states |
CN107274905A (en) * | 2016-04-08 | 2017-10-20 | 腾讯科技(深圳)有限公司 | A kind of method for recognizing sound-groove and system |
CN108922556A (en) * | 2018-07-16 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | sound processing method, device and equipment |
CN109346084A (en) * | 2018-09-19 | 2019-02-15 | 湖北工业大学 | Method for distinguishing speek person based on depth storehouse autoencoder network |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109801634A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of fusion method and device of vocal print feature |
-
2019
- 2019-04-18 CN CN201910312097.2A patent/CN110047504B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9685159B2 (en) * | 2009-11-12 | 2017-06-20 | Agnitio Sl | Speaker recognition from telephone calls |
US9792823B2 (en) * | 2014-09-15 | 2017-10-17 | Raytheon Bbn Technologies Corp. | Multi-view learning in detection of psychological states |
CN105139857A (en) * | 2015-09-02 | 2015-12-09 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Countercheck method for automatically identifying speaker aiming to voice deception |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN107274905A (en) * | 2016-04-08 | 2017-10-20 | 腾讯科技(深圳)有限公司 | A kind of method for recognizing sound-groove and system |
CN108922556A (en) * | 2018-07-16 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | sound processing method, device and equipment |
CN109346084A (en) * | 2018-09-19 | 2019-02-15 | 湖北工业大学 | Method for distinguishing speek person based on depth storehouse autoencoder network |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109801634A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of fusion method and device of vocal print feature |
Non-Patent Citations (6)
Title |
---|
DEHAK N, KENNY P J, DEHAK R, ET AL: "Front-end factor analysis for speaker verification", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
LONGTING XU,BO REN,GUANGLIN ZHANG,JICHEN YANG: "Linear transformation on x-vector for text-independent speaker verification", 《ELECTRONICS LETTERS》 * |
SAON G, SOLTAU H, NAHAMOO D, ET AL.: "Speaker adaptation of neural network acoustic models using i-vectors", 《2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING》 * |
XU L, DAS R K, YILMAZ E, ET AL: "Generative x-vectors for text-independent speaker verification", 《2018 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT)》 * |
XU L, LEE K A, LI H, ET AL: "Generalizing I-vector estimation for rapid speaker recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
徐珑婷: "基于稀疏分解的说话人识别技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081256A (en) * | 2019-12-31 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Digital string voiceprint password verification method and system |
CN111462759A (en) * | 2020-04-01 | 2020-07-28 | 科大讯飞股份有限公司 | Speaker labeling method, device, equipment and storage medium |
CN111462759B (en) * | 2020-04-01 | 2024-02-13 | 科大讯飞股份有限公司 | Speaker labeling method, device, equipment and storage medium |
WO2021174883A1 (en) * | 2020-09-22 | 2021-09-10 | 平安科技(深圳)有限公司 | Voiceprint identity-verification model training method, apparatus, medium, and electronic device |
CN113689861A (en) * | 2021-08-10 | 2021-11-23 | 上海淇玥信息技术有限公司 | Intelligent track splitting method, device and system for single sound track call recording |
CN113689861B (en) * | 2021-08-10 | 2024-02-27 | 上海淇玥信息技术有限公司 | Intelligent track dividing method, device and system for mono call recording |
CN114974259A (en) * | 2021-12-23 | 2022-08-30 | 号百信息服务有限公司 | Voiceprint recognition method |
CN115273863A (en) * | 2022-06-13 | 2022-11-01 | 广东职业技术学院 | Compound network class attendance system and method based on voice recognition and face recognition |
Also Published As
Publication number | Publication date |
---|---|
CN110047504B (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110047504A (en) | Method for distinguishing speek person under identity vector x-vector linear transformation | |
Chauhan et al. | Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large input database | |
CN102509547B (en) | Method and system for voiceprint recognition based on vector quantization based | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN109671442A (en) | Multi-to-multi voice conversion method based on STARGAN Yu x vector | |
CN109119072A (en) | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM | |
Kekre et al. | Speaker identification by using vector quantization | |
US20140236593A1 (en) | Speaker recognition method through emotional model synthesis based on neighbors preserving principle | |
Irum et al. | Speaker verification using deep neural networks: A | |
CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
CN109599091A (en) | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector | |
CN109346084A (en) | Method for distinguishing speek person based on depth storehouse autoencoder network | |
CN107358947A (en) | Speaker recognition methods and system again | |
CN108986798A (en) | Processing method, device and the equipment of voice data | |
CN112735435A (en) | Voiceprint open set identification method with unknown class internal division capability | |
CN104464738B (en) | A kind of method for recognizing sound-groove towards Intelligent mobile equipment | |
Kheder et al. | A unified joint model to deal with nuisance variabilities in the i-vector space | |
Hong et al. | Combining deep embeddings of acoustic and articulatory features for speaker identification | |
Ng et al. | Teacher-student training for text-independent speaker recognition | |
Wang et al. | Robust speaker identification of iot based on stacked sparse denoising auto-encoders | |
Sekkate et al. | Speaker identification for OFDM-based aeronautical communication system | |
Lin et al. | Mixture representation learning for deep speaker embedding | |
CN117095669A (en) | Emotion voice synthesis method, system, equipment and medium based on variation automatic coding | |
Monteiro et al. | On the performance of time-pooling strategies for end-to-end spoken language identification | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |