CN110047504A - Method for distinguishing speek person under identity vector x-vector linear transformation - Google Patents

Method for distinguishing speek person under identity vector x-vector linear transformation Download PDF

Info

Publication number
CN110047504A
CN110047504A CN201910312097.2A CN201910312097A CN110047504A CN 110047504 A CN110047504 A CN 110047504A CN 201910312097 A CN201910312097 A CN 201910312097A CN 110047504 A CN110047504 A CN 110047504A
Authority
CN
China
Prior art keywords
vector
identity
speaker
linear transformation
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910312097.2A
Other languages
Chinese (zh)
Other versions
CN110047504B (en
Inventor
徐珑婷
张光林
赵萍
张磊
季云云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201910312097.2A priority Critical patent/CN110047504B/en
Publication of CN110047504A publication Critical patent/CN110047504A/en
Application granted granted Critical
Publication of CN110047504B publication Critical patent/CN110047504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention relates to the method for distinguishing speek person under a kind of identity vector x-vector linear transformation, have main steps that: carrying out feature extraction to voice, and extract its identity vector x-vector and i-vector respectively;The training of parallel factor analysis device is carried out using the x-vector and i-vector of the same speaker;The corresponding parameter of x-vector in parallel factor analysis device is chosen, linear transformation is carried out to identity vector x-vector on this parameter basis and obtains xl-vector;To new identity vector xl-vector training PLDA model;Voice to be tested is carried out feature extraction and x-vector to extract, the linear quantizer that the training stage obtains is entered into and obtains new identity vector xl-vector, the PLDA model that the training stage obtains finally is entered into, to obtain final result.The present invention improves the recognition performance of Speaker Identification while guaranteeing that memory requirements is similar with baseline system with calculating speed.

Description

Method for distinguishing speek person under identity vector x-vector linear transformation
Technical field
The present invention relates to the present invention relates to the speaker Recognition Technologies in bio-identification, more specifically to a kind of identity Speaker Recognition Technology under vector x-vector linear transformation.
Background technique
Voice is the mode for the most direct convenience that the mankind carry out communication exchange, it is with its distinctive convenience, economy, standard The advantage of the various aspects such as true property causes the attention of each research institution.The research of Speech processing to promote human-computer interaction, Artificial Intelligence Development has great significance.For this purpose, the related fields of Speech processing, such as speech recognition, voice coding, language The directions such as sound synthesis, Speaker Identification receive more and more attention and theoretical research.Speaker Identification, also known as Application on Voiceprint Recognition, Its goal in research is to carry out authentication according to unique pronunciation of each speaker.The voice of each speaker suffers from unique Personal presentation, this is because the born phonatory organ of each speaker is different, while the such environmental effects where by the day after tomorrow and The one's own unique voice being trained.Just because of this otherness, so that voice is special as a kind of biology Property as identification target be possibly realized, Speaker Identification has also gradually formed a set of fairly perfect identification system of oneself.
Speaker Recognition System includes preprocessing part, characteristic extraction part, model training and matching primitives part.It says The key technology for talking about people's identification includes characteristic parameter extraction algorithm, the selection of model and Model Matching algorithm, directly determines knowledge The performance of other system.Speaker model, which is divided into, generates model and discrimination model.Generating model is each respective spy of classification of study Sign, i.e., multiple models, identification data are mapped in each model, and then determine which kind of identification data belong to;Discrimination model is Learning classification face, the classifying face can be used to distinguish which kind of different data are belonging respectively to.The two models are based on global poor The identity vector i-vector of anomalous mode type (TotalVariabilityModeling, TVM), Delayed Neural Networks are based on The identity vector x-vector of (Time-delayDeepNeuralNetwork, TDNN) is representative, is that current use is most extensive Two vector models.
The rear end part and i-vector rear end part of x-vector generally uses probability linear discriminant analysis (prob Abilisticlineardiscriminantanalysis, PLDA) rear end methods of marking.Result under x-vector model With the result of i-vector when long under voice quite, result is more preferable under Short Time Speech.How different paper studies mention System performance under high x-vector model, research shows that the model superposition or PLDA by i-vector and x-vector obtain Divide fusion that system performance can be improved, however this kind of method is designed into two kinds of systems, needs a large amount of memory requirements, calculates simultaneously Speed also will receive influence.Then, more researchs improve the robustness of x-vector by way of data extending, still This method is influenced by environment-identification.
Summary of the invention
The object of the present invention is to provide a kind of amount of ram for considering online recognition target speaker and calculate speaking for time People's recognition methods.
In order to achieve the above object, the technical solution of the present invention is to provide a kind of identity-based vector x-vector lines Property transformation under method for distinguishing speek person, which comprises the steps of:
Step 1, extract speaker training voice feature of the mel-frequency cepstrum coefficient as speaker;
Step 2 uses deep neural network structured training x-vector model using the feature that step 1 obtains, and establishes body Part vector x-vector model, to obtain identity vector x-vector;
Step 3 is based on EM algorithm training i-vector model using the feature that step 1 obtains, and establishes identity vector i- Vector model, to obtain identity vector i-vector;
Step 4 thinks that the i-vector of the same speaker and x-vector are projected in the same vector, is calculated based on EM Method training obtains the parameter of parallel factor analysis device, to complete the training of parallel factor analysis device;
Step 5 passes through linear quantizer, and the corresponding parameter of x-vector is retained in the parameter of parallel factor analysis device, On the basis of Linear Transformation device, identity vector xl-vector is expressed with the linear transformation of x-vector, to establish body Part vector xl- vector model obtains identity vector xl-vector;
Step 6 utilizes identity vector xl- vector is updated the parameter model of PLDA using EM algorithm, completion pair The training of PLDA model;
Step 7, the Speaker Identification of test phase
Identity vector x-vector model will be passed through after the corresponding voice progress feature extraction to be identified of registration voice Identity vector x-vector is obtained, the linear quantizer after identity vector x-vector input training is obtained into new identity vector xl- vector, finally by identity vector xl- vector is input to the PLDA model after training, to obtain Speaker Identification knot Fruit.
Preferably, in step 4, it is contemplated that different identity vector may map to the same vector space, using it is parallel because The method of son analysis obtains this common vector.
Preferably, in step 4, the identity vector i-vector of first of speaker is expressed as φi(l, 1) ..., φi(l, K), identity vector x-vector is expressed as φx(l, 1) ..., φx(l, k), wherein k indicates the input voice of the speaker Quantity, φi(l, k) indicates identity the vector i-vector, φ of the kth section voice of the identity vector of first of speakerx(l, k) table Show the identity vector x-vector of the kth section voice of the identity vector of first of speaker, the identity vector i- of the same speaker Vector and identity vector x-vector can be projected in the same vector, therefore can be expressed asWherein, μiIndicate the average vector of identity vector i-vector;μxIt indicates The average vector of identity vector x-vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding throwing of x-vector Shadow matrix;H (l) indicates the hidden variable of first of speaker;εi(l, k) indicates the kth section language of the identity vector of first of speaker Residual vector after the identity vector i-vector Linear Transformation of sound, εi~N (0, ∑i), ∑iIndicate the association side of i-vector Poor matrix, and N (0, ∑i) indicate εiMeeting matrix is 0, and covariance is ∑iNormal distribution;εx(l, k) indicates to say for first Residual vector after talking about the identity vector x-vector Linear Transformation of the kth section voice of the identity vector of people, εx~N (0, ∑x), ∑xIndicate residual epsilonxCovariance matrix, N (0, ∑x) indicate εxMeeting matrix is 0, and covariance is ∑xNormal distribution; By EM algorithm, parameter θ={ μ of parallel factor analysis device is obtainedi, Fi, ∑i, μx, Fx, ∑x}。
Preferably, in step 6, according to the corresponding parameter θ of x-vectorx={ μx, Fx, ∑xOn, after linear transformation Identity vector xl- vector is expressed asWherein,Indicate xlThe posteriority covariance of-vectorFurther write it as φxl=A φxThe form of-b, A, b are linear dimensions, so that identity be sweared Measure xl- vector is expressed as the linear transformation mode of x-vector.
The present invention in view of i-vector generation model information be it is helpful to x-vector model system, Training stage introduces i-vector, obtains the matrix of a linear transformation suitable for x-vector, and propose that a kind of x-vector is linear Method for distinguishing speek person under transformation.
In step 4 of the present invention, using x-vector and i-vector training parallel factor analysis device, in this way this point Parser had not only contained the information of x-vector, but also contained the information of i-vector, therefore obtained on the basis of this analyzer The linear quantizer of x-vector preferably remains the information of i-vector, so that new identity vector xl-vector With i-vector information, the final recognition performance for improving system.
Step of the present invention does not need to carry out i- again in the test phase of step 7 after the completion of the step of training stage of 1-6 Vector identity vector extracts, while parallel factor analysis device is after the training stage obtains, it is only necessary to retain the line of x-vector Property converter, therefore the memory requirements of test phase does not increase, while linear transformation has little effect actual operation.
The method of the present invention is to carry out in Speaker Identification using the identity vector after a kind of pair of x-vector linear transformation Identification.By rationally using i-vector information during the test, achieve the effect that improve recognition performance.Specifically It is exactly in test phase, by carrying out parallel factor analysis device instruction using the x-vector and i-vector of the same speaker Practice, choose parallel factor analysis device in the corresponding parameter of x-vector, on this parameter basis to identity vector x-vector into Row linear transformation obtains xl-vector;In test phase, voice to be tested is subjected to feature extraction and x-vector is extracted, It is entered into the linear quantizer that the training stage obtains and obtains new identity vector xl- vector is finally entered into instruction The PLDA model that the white silk stage obtains, to obtain final result.
Thus can produce it is such the utility model has the advantages that
(1) using x-vector and i-vector training parallel factor analysis device, this analyzer so both contains x- The information of vector, but the x-vector's for containing the information of i-vector, therefore obtaining on the basis of this analyzer is linear Converter preferably remains the information of i-vector, so that new identity vector xl- vector believes with i-vector Breath, the final recognition performance for improving system;
(2) test phase does not need to carry out the extraction of i-vector identity vector again, while parallel factor analysis device is in training After stage obtains, it is only necessary to retain the linear quantizer of x-vector, therefore the memory requirements of test phase does not increase, simultaneously Linear transformation has little effect actual operation.
Detailed description of the invention
Fig. 1 is the Speaker Identification flow chart under present invention implementation identity vector x-vector linear transformation;
Fig. 2 is the parameter setting situation of frame number layer in x-vector neural network framework.
Specific embodiment
Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.
The side of speaker Recognition Technology under a kind of identity vector x-vector linear transformation disclosed by the embodiments of the present invention Method, as shown in Figure 1, comprising the following steps:
Step 1, feature extraction --- the present invention use mel-frequency cepstrum coefficient (Mel Frequency Cepstral Coefficients, MFCC) feature as speaker.The log series model that mel-frequency scale is corresponding generally to actual frequency closes System: Mel (f)=2595lg (1+f/700), in formula, Mel (f) indicates that mel-frequency, f indicate common frequencies.According to following behaviour Make the available MFCC feature of mode: (1) pre-processing, including preemphasis, framing adding window, end-point detection, if voice signal x (m) It is after pretreatment xi(m), i indicates frame number;(2) Fast Fourier Transform (FFT) X (i, k)=FFT [xi(m)], X (i, k) indicates frequency Spectrum signal;(3) line energy calculates E (i, k)=[X (i, k)]2;(4) Meier filter energy is calculated Wherein HmIt (k) is Meier filter function, M indicates the number of filter;(5) dct transform and logarithm is sought.
Step 2, identity vector x-vector model foundation --- x-vector model training are based on deep neural network knot Structure.First 5 layers are that frame level is other, and TDNN total input is a Duan Yuyin, and each TDNN takes fixed frame number, and the network parameter of first five layer is set Definite opinion Fig. 2.Then after pond layer accumulates the output vector of each TDNN, mean value and standard deviation are calculated as pond layer Output.Followed by two layers of omnidirectional's articulamentum of pond layer adds one softmax layers finally to export.The neuron number of output and Speaker's number is consistent in our training sets, and the output of neural network is a posterior probability.It is logical based on the neural network Crossing multiple repetitive exercise uses layer 6 output as x-vector model.
Step 3, identity vector i-vector model foundation --- the voice sequence of a given speaker s is O={ o1, o2..., oT, c-th of Gaussian component can be expressed as o in the voice sequence of t momentC, tc+TcX+ ε, oC, tIndicate c Voice sequence of a Gaussian component in t moment, μcIndicate the mean value of c-th of Gaussian component, TcIndicate the throwing of c-th of Gaussian component Shadow matrix, x indicate the hidden variable of the speaker, and ε indicates residual error portion, select EM algorithm training i-vector model.Wherein E is walked Suddenly in (seeking desired value), first order statistic FcWith second-order statistic ScDefinition be respectively as follows: Fc=∑tγc(t)(oC, tc), Sc =∑tγc(t)(oC, tc)(oC, tc)T, γc(t) indicate t frame voice in c-th of Gaussian component occupation rate, the posteriority of x Mean value is expressed as φ=L-1TT-1F, wherein L-1Indicate the posteriority covariance of identity vector i-vector,NcIndicate the zero order statistical amount of c-th of Gaussian component, I indicates that unit vector, T indicate all Gausses point Measure TcThe matrix of composition, F indicate that first order statistic, ∑ indicate the covariance matrix of residual epsilon.M step (maximization) main purpose It is optimization matrix T and matrix ∑, by rightDerivation obtains the two matrixes Optimal solution, F (s) indicate that the first order statistic of s sections of voices, x (s) indicate the hidden variable of s sections of voices, and N (s) indicates s The zero order statistical amount of Duan Yuyin.By establishing i-vector model to E the and M step process that successively iteration updates.
Step 4, parallel factor analysis device --- the identity vector i-vector of first of speaker is expressed as φ for trainingi(l, ..., φ 1)i(l, k), identity vector x-vector are expressed as φx(l, 1) ..., φx(l, k), wherein k indicates the speaker Input voice quantity, φi(l, k) indicates the identity vector i- of the kth section voice of the identity vector of first of speaker Vector, φx(l, k) indicates the identity vector x-vector of the kth section voice of the identity vector of first of speaker, same The identity vector i-vector and identity vector x-vector of speaker can be projected in the same vector, therefore can be indicated ForWherein, μiIndicate the average vector of identity vector i-vector;μxIndicate identity vector x- The average vector of vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding projection matrix of x-vector;h (l) hidden variable of first of speaker is indicated;εi(l, k) indicates the identity of the kth section voice of the identity vector of first of speaker Residual vector after vector i-vector Linear Transformation, εi~N (0, ∑i), ∑iIndicate the covariance matrix of i-vector, N (0, ∑i) indicate εiMeeting matrix is 0, and covariance is ∑iNormal distribution;εx(l, k) indicates the body of first of speaker Residual vector after the identity vector x-vector Linear Transformation of the kth section voice of part vector, εx~N (0, ∑x), ∑xIt indicates Residual epsilonxCovariance matrix, N (0, ∑x) indicate εxMeeting matrix is 0, and covariance is ∑xNormal distribution;By EM algorithm, Obtain parameter θ={ μ of parallel factor analysis devicei,Fiix,Fxx}。
Step 5, linear quantizer --- i-vector and x- are contained in the parallel factor analysis device that the training stage obtains The parameter of vector, in practical on-line operation, it is only necessary to the corresponding parameter θ of x-vectorx={ μx, Fx, ∑x}.In this parameter On obtain the identity vector x after linear transformationl- vector model.
Step 6, identity vector xl-vector model foundation --- in the corresponding parameter θ of x-vectorx={ μx, Fx, ∑x} On, by the identity vector x after linear transformationl- vector is expressed as Indicate identity vector x- The posteriority covariance of vectorFurther write it as φxl=A φxThe form of-b, A, b are linear Parameter, thus by identity vector xl- vector is expressed as the linear transformation mode of x-vector.
Step 7, PLDA model training --- assuming that training data voice is made of the voice of i speaker, wherein each Speaker has j sections of oneself different voices.So, the j-th strip voice that we define i-th of speaker is xij.Then, according to because Sub- analytic definition xijGeneration model are as follows: xij=μ+Fhi+Gwijij, μ expression mean value vector, F expression speaker information matrix, hiIndicate the hidden variable of i-th of speaker, G indicates channel information matrix, wijIndicate the j-th strip voice of i-th of speaker The hidden variable of channel, εijIndicate the residual error portion of the j-th strip voice of i-th of speaker.Using EM algorithm to the parameter mould of PLDA Type is updated.
Step 8, the Speaker Identification of test phase --- the corresponding voice to be identified of registration voice is subjected to feature It extracts and x-vector is extracted, be entered into the linear quantizer that the training stage obtains and obtain new identity vector xl- Vector is finally entered into the PLDA model that the training stage obtains, to obtain final result.
The method of the present invention is emulated and analyzed below.
In 2010 test set of NIST SRE, to the Speaker Identification under x-vector identity vector after linear transformation Simulating, verifying is can be carried out.The test set includes the test assignment of 9 scenes (common condition, CC), includes interview (interview), the data of microphone (microphone) and telephone channel (telephone), wherein telephone channel also for Speaker's style different volume incorporated above mainly includes louder volume (high vocal effort), usual volume (normal vocal effort) and amount of bass (low vocal effort).The present invention uses the 5th scene (CC ' 5), i.e., Scene based on different telephone channels under usual volume.Evaluation standard using etc. error rates (Equal Error Rate, EER) with And the performance of Detectability loss function (Detection Cost Function, DCF) Lai Hengliang Speaker Recognition System.
In tri- task test sets of coreext-coreext, core-10sec, 10sec-10sec of NIST SRE 2010 It closes and is tested, wherein coreext and core refers to that long Shi Yuyin, 10sec refer to Short Time Speech.It is used in emulation The voice data of Switchboard2, Switchboard Cellular and NIST SRE 2004 to 2008 is as training Data.Experiment is using x-vector and i-vector system as baseline model.Men and women's sound UBM is trained together, x-vector model Use acoustic feature for 20 dimension MFCC feature, i-vector model using it is same 20 dimension mfcc static nature parameters and Its single order and second differnce, i.e. 60 dimensional features.To each section of voice segments, 600 dimension i-vector vectors and 512 dimension x- are respectively obtained Vector vector.Identity vector dimensionality reduction to 400 is tieed up with the method for LDA in baseline system, then trains speaker's sky Between order be 200 dimensions, channel space order is 0 dimension, and the PLDA model of full variance matrix.Xl-vector proposed by the present invention exists Speaker's variable between class distance maximum, the smallest factor of inter- object distance are had been contemplated that in design process, therefore do not use LDA to walk Suddenly.
Table 1 is in tri- tasks of coreext-coreext, core-10sec, 10sec-10sec, and homologous ray is not in EER The comparison of evaluation criterion and DCF evaluation criterion, the dimension of xl-vector are 512.Wherein i-vector and x-vector is two A baseline system, emerging system are the system that i-vector is added with the score of the PLDA model of x-vector.? In tri- tasks of coreext-coreext, core-10sec, 10sec-10sec, xl-vector proposed by the present invention is commented in EER It is better than two baseline systems in price card standard, in DCF evaluation criterion in 10sec-10sec task slightly with respect to x-vector system There is reduction, other two tasks are better than two baseline systems.Xl-vector system is compared to emerging system in coreext- EER advantage in coreext task is obvious, memory and calculating speed phase needed for xl-vector and x-vector Seemingly, however emerging system needs to consider x-vector and i-vector, therefore more memories, calculating speed are needed when operation It is slack-off.To sum up, xl-vector of the invention suffers from apparent advantage compared to two baseline systems and emerging system.
Table 1
Table 2 is the new identity vector in tri- tasks of coreext-coreext, core-10sec, 10sec-10sec The comparison of xl-vector EER evaluation criterion and DCF evaluation criterion under different dimensions.It can be found that in coreext-coreext With the increase of dimension in task, the performance of EER is become better and better, and when dimension is 500, performance is optimal value, is in dimension When 512, optimal value is kept substantially;The performance of DCF remains unchanged substantially.In core-10sec and 10sec-10sec task with The increase of dimension, the performance of EER worse and worse, dimension be 200 when, performance is optimal value;The transformation range of DCF maintains Within 10%.To sum up, when test statement is long when sentence, dimension more high-performance is better, is sentence in short-term in test statement When, dimension more low performance is better.
Table 2
It can be seen that the xl-vector model that inventor proposes passes through x-vector and i-vector in the training stage Parallel factor analysis device obtains a linear transformation algorithm to x-vector, improves the performance of Speaker Recognition System, and Keep memory demand and the unaffected advantage of calculating speed.

Claims (4)

1. the method for distinguishing speek person under identity-based vector x-vector linear transformation, which comprises the steps of:
Step 1, extract speaker training voice feature of the mel-frequency cepstrum coefficient as speaker;
Step 2 uses deep neural network structured training x-vector model using the feature that step 1 obtains, and establishes identity arrow X-vector model is measured, to obtain identity vector x-vector;
Step 3 is based on EM algorithm training i-vector model using the feature that step 1 obtains, and establishes identity vector i-vector Model, to obtain identity vector i-vector;
Step 4 thinks that the i-vector of the same speaker and x-vector are projected in the same vector, is instructed based on EM algorithm The parameter of parallel factor analysis device is got, to complete the training of parallel factor analysis device;
Step 5 passes through linear quantizer, the corresponding parameter of x-vector is retained in the parameter of parallel factor analysis device, online Property converter on the basis of, identity vector xl-vector is expressed with the linear transformation of x-vector, thus establish identity arrow Measure xl- vector model obtains identity vector xl-vector;
Step 6 utilizes identity vector xl- vector is updated the parameter model of PLDA using EM algorithm, completes to PLDA mould The training of type;
Step 7, the Speaker Identification of test phase
It will be obtained after the corresponding voice progress feature extraction to be identified of registration voice by identity vector x-vector model Linear quantizer after identity vector x-vector input training is obtained new identity vector x by identity vector x-vectorl- Vector, finally by identity vector xl- vector is input to the PLDA model after training, to obtain Speaker Identification result.
2. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that: In step 4, it is contemplated that different identity vector may map to the same vector space, be obtained using the method for parallel factor analysis This common vector.
3. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that: In step 4, the identity vector i-vector of first of speaker is expressed as φi(l,1),…,φi(l, k), identity vector x- Vector is expressed as φx(l,1),…,φx(l, k), wherein k indicates the quantity of the input voice of the speaker, φi(l, k) table Show identity the vector i-vector, φ of the kth section voice of the identity vector of first of speakerx(l, k) indicates first of speaker Identity vector kth section voice identity vector x-vector, the identity vector i-vector and identity of the same speaker Vector x-vector can be projected in the same vector, therefore can be expressed asWherein, μiIndicate the average vector of identity vector i-vector;μxIt indicates The average vector of identity vector x-vector;FiIndicate the corresponding projection matrix of i-vector;FxIndicate the corresponding throwing of x-vector Shadow matrix;H (l) indicates the hidden variable of first of speaker;εi(l, k) indicates the kth section language of the identity vector of first of speaker Residual vector after the identity vector i-vector Linear Transformation of sound, εi~N (0, Σi), ΣiIndicate the association side of i-vector Poor matrix, N (0, Σi) indicate εiMeeting matrix is 0, covariance ΣiNormal distribution;εx(l, k) indicates to say for first Residual vector after talking about the identity vector x-vector Linear Transformation of the kth section voice of the identity vector of people, εx~N (0, Σx), ΣxIndicate residual epsilonxCovariance matrix, N (0, Σx) indicate εxMeeting matrix is 0, covariance ΣxNormal distribution; By EM algorithm, parameter θ={ μ of parallel factor analysis device is obtainedi,Fiix,Fxx}。
4. the method for distinguishing speek person under identity vector x-vector linear transformation according to claim 1, it is characterised in that: In step 6, according to the corresponding parameter θ of x-vectorx={ μx,FxxOn, by the identity vector x after linear transformationl-vector It is expressed asWherein,Indicate xlThe posteriority covariance of-vector Further write it as φxl=A φxThe form of-b, A, b are linear dimensions, thus by identity vector xl- vector is expressed as x- The linear transformation mode of vector.
CN201910312097.2A 2019-04-18 2019-04-18 Speaker identification method under identity vector x-vector linear transformation Active CN110047504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910312097.2A CN110047504B (en) 2019-04-18 2019-04-18 Speaker identification method under identity vector x-vector linear transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910312097.2A CN110047504B (en) 2019-04-18 2019-04-18 Speaker identification method under identity vector x-vector linear transformation

Publications (2)

Publication Number Publication Date
CN110047504A true CN110047504A (en) 2019-07-23
CN110047504B CN110047504B (en) 2021-08-20

Family

ID=67277768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910312097.2A Active CN110047504B (en) 2019-04-18 2019-04-18 Speaker identification method under identity vector x-vector linear transformation

Country Status (1)

Country Link
CN (1) CN110047504B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081256A (en) * 2019-12-31 2020-04-28 苏州思必驰信息科技有限公司 Digital string voiceprint password verification method and system
CN111462759A (en) * 2020-04-01 2020-07-28 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
WO2021174883A1 (en) * 2020-09-22 2021-09-10 平安科技(深圳)有限公司 Voiceprint identity-verification model training method, apparatus, medium, and electronic device
CN113689861A (en) * 2021-08-10 2021-11-23 上海淇玥信息技术有限公司 Intelligent track splitting method, device and system for single sound track call recording
CN114974259A (en) * 2021-12-23 2022-08-30 号百信息服务有限公司 Voiceprint recognition method
CN115273863A (en) * 2022-06-13 2022-11-01 广东职业技术学院 Compound network class attendance system and method based on voice recognition and face recognition

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
US9685159B2 (en) * 2009-11-12 2017-06-20 Agnitio Sl Speaker recognition from telephone calls
US9792823B2 (en) * 2014-09-15 2017-10-17 Raytheon Bbn Technologies Corp. Multi-view learning in detection of psychological states
CN107274905A (en) * 2016-04-08 2017-10-20 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN109346084A (en) * 2018-09-19 2019-02-15 湖北工业大学 Method for distinguishing speek person based on depth storehouse autoencoder network
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109801634A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of fusion method and device of vocal print feature

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9685159B2 (en) * 2009-11-12 2017-06-20 Agnitio Sl Speaker recognition from telephone calls
US9792823B2 (en) * 2014-09-15 2017-10-17 Raytheon Bbn Technologies Corp. Multi-view learning in detection of psychological states
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN107274905A (en) * 2016-04-08 2017-10-20 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN109346084A (en) * 2018-09-19 2019-02-15 湖北工业大学 Method for distinguishing speek person based on depth storehouse autoencoder network
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109801634A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of fusion method and device of vocal print feature

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DEHAK N, KENNY P J, DEHAK R, ET AL: "Front-end factor analysis for speaker verification", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
LONGTING XU,BO REN,GUANGLIN ZHANG,JICHEN YANG: "Linear transformation on x-vector for text-independent speaker verification", 《ELECTRONICS LETTERS》 *
SAON G, SOLTAU H, NAHAMOO D, ET AL.: "Speaker adaptation of neural network acoustic models using i-vectors", 《2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING》 *
XU L, DAS R K, YILMAZ E, ET AL: "Generative x-vectors for text-independent speaker verification", 《2018 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT)》 *
XU L, LEE K A, LI H, ET AL: "Generalizing I-vector estimation for rapid speaker recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
徐珑婷: "基于稀疏分解的说话人识别技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081256A (en) * 2019-12-31 2020-04-28 苏州思必驰信息科技有限公司 Digital string voiceprint password verification method and system
CN111462759A (en) * 2020-04-01 2020-07-28 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
CN111462759B (en) * 2020-04-01 2024-02-13 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
WO2021174883A1 (en) * 2020-09-22 2021-09-10 平安科技(深圳)有限公司 Voiceprint identity-verification model training method, apparatus, medium, and electronic device
CN113689861A (en) * 2021-08-10 2021-11-23 上海淇玥信息技术有限公司 Intelligent track splitting method, device and system for single sound track call recording
CN113689861B (en) * 2021-08-10 2024-02-27 上海淇玥信息技术有限公司 Intelligent track dividing method, device and system for mono call recording
CN114974259A (en) * 2021-12-23 2022-08-30 号百信息服务有限公司 Voiceprint recognition method
CN115273863A (en) * 2022-06-13 2022-11-01 广东职业技术学院 Compound network class attendance system and method based on voice recognition and face recognition

Also Published As

Publication number Publication date
CN110047504B (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN110047504A (en) Method for distinguishing speek person under identity vector x-vector linear transformation
Chauhan et al. Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large input database
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN109671442A (en) Multi-to-multi voice conversion method based on STARGAN Yu x vector
CN109119072A (en) Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
Kekre et al. Speaker identification by using vector quantization
US20140236593A1 (en) Speaker recognition method through emotional model synthesis based on neighbors preserving principle
Irum et al. Speaker verification using deep neural networks: A
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN109599091A (en) Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
CN107358947A (en) Speaker recognition methods and system again
CN108986798A (en) Processing method, device and the equipment of voice data
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
CN104464738B (en) A kind of method for recognizing sound-groove towards Intelligent mobile equipment
Kheder et al. A unified joint model to deal with nuisance variabilities in the i-vector space
Hong et al. Combining deep embeddings of acoustic and articulatory features for speaker identification
Ng et al. Teacher-student training for text-independent speaker recognition
Wang et al. Robust speaker identification of iot based on stacked sparse denoising auto-encoders
Sekkate et al. Speaker identification for OFDM-based aeronautical communication system
Lin et al. Mixture representation learning for deep speaker embedding
CN117095669A (en) Emotion voice synthesis method, system, equipment and medium based on variation automatic coding
Monteiro et al. On the performance of time-pooling strategies for end-to-end spoken language identification
Koolagudi et al. Speaker recognition in the case of emotional environment using transformation of speech features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant