CN101315771A - Compensation method for different speech coding influence in speaker recognition - Google Patents

Compensation method for different speech coding influence in speaker recognition Download PDF

Info

Publication number
CN101315771A
CN101315771A CNA2008100646691A CN200810064669A CN101315771A CN 101315771 A CN101315771 A CN 101315771A CN A2008100646691 A CNA2008100646691 A CN A2008100646691A CN 200810064669 A CN200810064669 A CN 200810064669A CN 101315771 A CN101315771 A CN 101315771A
Authority
CN
China
Prior art keywords
map
speaker
sequence
lambda
deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100646691A
Other languages
Chinese (zh)
Inventor
韩纪庆
李雪林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CNA2008100646691A priority Critical patent/CN101315771A/en
Publication of CN101315771A publication Critical patent/CN101315771A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a compensating method of different speech coding influence in speaker recognition, in particular to a compensating method for mismatching of the speech coding in the speaker recognition in the Internet so as to solves the problem of degradation of performance of the speaker recognition caused by the mismatching of training speech and test speech coding in the speaker recognition. The method carries out characteristic processing for a voice signal of the speaker under a standard encoding mode and takes a speaker model under the standard encoding mode obtained by expectation-maximization algorithm training as a match object library; the voice signal of the speaker to be recognized is input and treated with characteristic extraction to obtain a characteristic vector sequence; the front T frames in the characteristic sequence are selected to obtain a sequence and then an MAP algorithm is carried out so as to obtain deviation of a current code and a standard code in a self-adapting way; the original characteristic sequence is adjusted and compensated by using the obtained deviation of the current code and the standard code to obtain a new characteristic vector sequence; the new characteristic vector sequence is used for respectively being matched with the speaker model under the standard encoding mode and judging to obtain a recognition result.

Description

The compensation method of different speech coding influence in the Speaker Identification
Technical field
The present invention relates to the compensation method in a kind of speaker Recognition Technology field, specifically be a kind of on the Internet to Speaker Identification in the compensation method of voice coding when not matching.
Background technology
Speaker Identification is meant by the analyzing and processing to speaker's voice signal, confirms the speaker automatically whether in words person's set of being write down, and confirms further whom the speaker is.Although under breadboard clean speech environment, Speaker Recognition System has obtained reasonable effect, in the middle of real world applications, the performance of Speaker Recognition System will be subjected to the several factors restriction, and the recognition result of system can't be satisfactory.The mismatch problem of speech signal coding when the one of the main reasons that wherein influences performance is the training and testing that causes owing to various factors.Along with the development of modern network technology, more and more by the application of the Internet voice signal.Voice during network transmits adopt more ratio of compression higher relatively in, low rate voice coding or audio coding.The voice of low rate (audio frequency) are though compressed encoding has brought convenience for the transmission of channel, also saved storage space, but because most of voice (audio frequency) coding all is a lossy compression method, voice quality will certainly incur loss, simultaneously, more outstanding is that its coding mechanism of different coded systems is not the same, especially adopts the situation of Streaming Media coded system.Therefore, adopt voice signal after the different coding mode to exist the mismatch problem of aspect such as characteristic parameter, and often we under carrying out network during Speaker Identification, obtainable training data be the signal that adopts under certain voice (audio frequency) coded system, and when reality is used, voice signal to be measured is the signal of other coded systems, at this moment Speaker Identification just is faced with the training and testing voice owing to the different mismatch problems that produce of coding, this will influence the performance of Speaker Identification, for this reason, need research effectively to overcome the compensation method of different speech coding influence.
Summary of the invention
The present invention is for solving in the Speaker Identification process, and the problem that the Speaker Identification performance that training utterance and tested speech coding cause when not matching descends provides the compensation method of different speech coding influence in a kind of Speaker Identification.The present invention is realized by following steps:
Step 1, adopt certain coded system, N speaker carried out speaker model { λ under N the standard code mode that characteristic processing and the training of greatest hope algorithm obtain successively at the voice signal under the standard code mode as the standard code mode n} N=1 NAs the match objects storehouse, wherein N represents natural number;
Step 2, input speaker's to be identified voice signal s (n) carries out feature extraction to the voice signal of importing and obtains feature vector sequence X={x 1, x 2..., x S, wherein S represents natural number;
Step 3, in feature vector sequence X, select its preceding T frame to obtain sequence X T={ x 1, x 2..., x T, with this T frame sequence X TCarry out the deviation h of MAP algorithm self-adaptation acquisition present encoding and standard code MAP, wherein T represents natural number;
Present encoding and standard code deviation h that step 4, usefulness obtain MAPX adjusts compensation to feature vector sequence, obtains new feature vector sequence X, wherein X={x 1-h MAP, x 2-h MAP..., x S-h MAP;
Step 5, with new feature vector sequence X respectively with N standard code mode under speaker model { λ n} N=1 NMate and adjudicate and obtain recognition result.
Beneficial effect: the feature under the coding that the present invention is adopted when adjusting Speaker Identification, make it near the phonetic feature in the match objects storehouse, and employing Gaussian distribution estimated coding deviation, the speaker's phonetic feature distortion that reduces to encode and cause, thereby reduce speaker's voice coding problem that the discrimination that causes reduces that do not match, the system's average recognition rate when coding is not matched has improved 7.1%.
Description of drawings
Fig. 1 is the variation diagram of system recognition rate when the value from 0 to 0.9 of adjusting factor-alpha; Fig. 2 is the variation diagram of system recognition rate when adopting baseline system and maximal posterior probability algorithm to encode compensation respectively, and wherein " → " represents the variation line of the system recognition rate that employing MAP algorithm obtains,
Figure A20081006466900041
The variation line of the system recognition rate that expression employing baseline system obtains.
Embodiment
Embodiment one: referring to Fig. 1 and Fig. 2, present embodiment is made up of following steps:
Step 1, adopt coding, mp3 coding, rm coding or wherein a kind of coded system of wma coding, N speaker carried out characteristic processing and greatest hope algorithm successively at the voice signal under the standard code mode train speaker's gauss hybrid models { λ under N the standard code mode that obtains as the standard code mode n} N=1 NAs the match objects storehouse, wherein N represents natural number;
Step 2, input speaker's to be identified voice signal s (n) carries out feature extraction to the voice signal of importing and obtains feature vector sequence X={x 1, x 2..., x S, wherein S represents natural number;
Step 3, in feature vector sequence X, select its preceding T frame to obtain sequence X T={ x 1, x 2..., x T, with this T frame sequence X TCarry out the deviation h of MAP algorithm self-adaptation acquisition present encoding and standard code MAP, wherein T represents natural number;
Present encoding and standard code deviation h that step 4, usefulness obtain MAPX adjusts compensation to feature vector sequence, obtains new feature vector sequence X, wherein X={x 1-h MAP, x 2-h MAP..., x S-h MAP;
Step 5, with new feature vector sequence X respectively with N standard code mode under speaker model { λ n} N=1 NMate and adjudicate and obtain recognition result.
The process of the feature extraction in the step 2 is in the present embodiment: speaker's signal s (n) is carried out sample quantization and pre-emphasis processing, suppose that speaker's signal is stably in short-term, so can carrying out the branch frame, handles speaker's signal, the concrete frame method that divides is that the method that adopts finite length window movably to be weighted realizes, to the voice signal s after the weighting w(n) calculate linear predictive coding (LPC), obtain feature vector sequence X={x according to the relation between LPC and the linear prediction cepstrum coefficient (LPCC) then 1, x 2..., x S, the relation between LPC and the LPCC is as follows:
c LP ( 1 ) = a 1 c LP ( n ) = a n + &Sigma; k = 1 n - 1 k n a n - k c LP ( k ) , 1 < n &le; p c LP ( n ) = &Sigma; k = 1 p k n a n - k c LP ( k ) , n > p
Wherein, c LP(n) represent the n of LPCC to tie up component, a nBe the n dimension component of LPC, p is the dimension of LPC, and n represents natural number.
Step 3 and four computation process are: suppose to have the deviation h that encodes between the coding under the coding and training utterance under the tested speech, this deviation h can be μ with an average h, covariance matrix is a ∑ hSingle Gaussian distribution N (μ h, ∑ h) represent, according to MAP estimation criterion, h MAPMAP be estimated as:
h &OverBar; MAP = arg max h { p ( h | X , &lambda; ) } - - - ( 1 )
Wherein, λ is with reference to speaker model, the preceding T frame sequence X that the X representative is chosen T
According to the monotonicity of Bayesian formula and logarithmic function, formula (1) is equivalent to:
h &OverBar; MAP = arg max h { log p ( X | h , &lambda; ) + log p ( h ) } - - - ( 2 )
Wherein, p (h) is the priori of coding deviation h.
In order to be limited in do not encode the simultaneously priori proportion of deviation h of self-adapting data amount, in formula (2), add and adjust factor-alpha, obtain following formula:
h &OverBar; MAP = arg max h { &alpha; log p ( X | h , &lambda; ) + ( 1 - &alpha; ) log p ( h ) } - - - ( 3 )
Wherein, (X|h λ) satisfies the mixed Gaussian distribution form, that is: to p
p ( X | h , &lambda; ) = &Sigma; i = 1 M p ( X , i | h , &lambda; ) = &Sigma; i = 1 M c i p i ( X | h , &lambda; ) - - - ( 4 )
M is 64, and i represents i mixed components, c iRepresent the weight that each mixed components is shared.
Find the solution formula (3), estimate the present encoding deviation with greatest hope (Expectation Maximum) algorithm in T frame adaptive data centralization, the function that obtains after a series of formula variations of latent status switch Q process for gauss hybrid models is:
Q ( h , h &OverBar; ) = &alpha; &Sigma; t = 1 T &Sigma; i = 1 M p ( x t , i | h , &lambda; ) p ( x t | h , &lambda; ) log p ( x t , i | h &OverBar; , &lambda; ) + ( 1 - &alpha; ) T log p ( h &OverBar; ) - - - ( 5 )
Wherein, h is previous iteration result; H is current iteration result.x tIt is the phonetic feature of t frame; P (x t, i|h, after λ) expression is adjusted t frame voice with deviation h, the probability on i the mixed components of model λ; P (x t| h, λ) for after adjusting t frame voice with deviation h, the probability on all mixed components of model λ; P (x t, i|h, λ) for after adjusting t frame voice with deviation h, the probability on i the mixed components of model λ; P (h) is the previous priori of coding deviation h.
Suppose the covariance matrix ∑ of coding deviation h hGet diagonal matrix, then order &PartialD; Q &PartialD; h &OverBar; j = 0 , Have
h &OverBar; j = &alpha; &Sigma; t = 1 T [ &Sigma; i = 1 M c i p i ( x i | h , &lambda; ) p ( x t | h , &lambda; ) &times; ( x tj - u ij ) &sigma; ij 2 ] + ( 1 - &alpha; ) Tu hj &sigma; hj 2 &alpha; &Sigma; t = 1 T [ &Sigma; i = 1 M c i p i ( x i | h , &lambda; ) p ( x t | h , &lambda; ) &times; 1 &sigma; ij 2 ] + ( 1 - &alpha; ) T &sigma; hj 2 - - - ( 6 )
Wherein, h jBe the value of the j of current iteration result vector h dimension, j=1,2 ..., L, L are the dimension of eigenvector; x TjValue for the j dimension of the T frame proper vector of tested speech; μ Ij, σ Ij 2Be respectively j average and j the variance of i mixed components of speaker model under the standard code; μ Hj, σ Hj 2Be respectively coding deviation h average μ hThe value and the covariance matrix ∑ of j dimension hJ value.
In the above in the estimation formulas to the coding deviation, about the μ of priori of coding deviation Hj, σ Hj 2Be unknown quantity, thereby carrying out before MAP estimates, the priori of the deviation h that at first needs to obtain to encode.
For the priori of the deviation h that obtains to encode, make that factor-alpha is 1 in the formula (6), at this moment the maximum a posteriori probability method of estimation becomes the maximum likelihood method of estimation, and corresponding iterative formula is as follows:
h &OverBar; j = &alpha; &Sigma; t = 1 T [ &Sigma; i = 1 M c i p i ( x t | h , &lambda; ) p ( x t | h , &lambda; ) &times; ( x tj - &mu; ij ) &sigma; ij 2 ] &alpha; &Sigma; t = 1 T [ &Sigma; i = 1 M c i p i ( x t | h , &lambda; ) p ( x t | h , &lambda; ) &times; 1 &sigma; ij 2 ] - - - ( 7 )
If H class coding is arranged, can obtain the estimated value of H class coding deviation h by formula (7), be expressed as { h M1, h M2..., h MH, utilize formula (8) and (9) can estimate μ at last hAnd ∑ hValue.
u h = 1 H &Sigma; k = 1 H h &OverBar; Mk - - - ( 8 )
&Sigma; h = 1 H &Sigma; k = 1 H ( h &OverBar; Mk - u h ) 2 - - - ( 9 )
The problem that exists coding deviation h initial value to set in formula (7) is here with the initial value h of the accumulative total of the difference between the average of voice under the current non-standard coding and the reference words person model under the standard code as h 0, shown in the following formula, c wherein iWeights for i mixed components of reference speaker model GMM;
h 0 = 1 T &Sigma; t = 1 T &Sigma; i = 1 M [ c i &times; ( x t - u i ) ] - - - ( 10 )
The estimated value that deviation h has been arranged can be passed through the raw tone feature space of present encoding the feature space of compensation map to standard code, and concrete compensation policy is:
X=X-h MAP (11)
Coupling and judging process are in the step 5: for feature vector sequence X, and the new feature vector sequence X in the formula after the X representative compensation, this artificial N speaker's posterior probability:
p ( &lambda; n | X ) = p ( X | &lambda; n ) p ( &lambda; n ) p ( X ) = p ( X | &lambda; n ) p ( &lambda; n ) &Sigma; m = 1 N p ( X | &lambda; n ) p ( &lambda; m ) - - - ( 12 )
Wherein, p (λ n) be N the prior probability that the people speaks; P (X) is the probability density of feature vector sequence X under N speaker's condition in the match objects storehouse; P (X| λ n) be the class conditional probability that N people produces feature vector sequence X.The maximum posteriori criterion of recognition result:
n * = arg max 1 &le; n &le; N p ( &lambda; n | X ) - - - ( 13 )
N wherein *Expression identification court verdict.Suppose that the prior probability that everyone speaks equates to obtain:
p ( &lambda; n ) = 1 N , n = 1 , 2 , . . . , N - - - ( 14 )
For each speaker, the p (X) in the formula (12) equates in addition.Like this, formula (13) can be write as
n * = arg max 1 &le; n &le; N p ( X | &lambda; n ) - - - ( 15 )
At this moment, maximum posteriori criterion has just changed into maximum likelihood criterion.
Usually in order to simplify calculating, generally adopt log-likelihood function, court verdict is:
n * = arg max 1 &le; n &le; N ln p ( X | &lambda; n ) - - - ( 16 )
Formula (16) is exactly closed set test decision rule.The closed set test only is discussed here, is avoided the influence of opener test threshold to discrimination, the unmatched influence of outstanding coding is to reduce the problem complexity.

Claims (2)

1, the compensation method of different speech coding influence in the Speaker Identification is characterized in that it is realized by following steps:
Step 1, adopt certain coded system, N speaker carried out speaker model { λ under N the standard code mode that characteristic processing and the training of greatest hope algorithm obtain successively at the voice signal under the standard code mode as the standard code mode n} N=1 NAs the match objects storehouse, wherein N represents natural number;
Step 2, input speaker's to be identified voice signal s (n) carries out feature extraction to the voice signal of importing and obtains feature vector sequence X={x 1, x 2..., x S, wherein S represents natural number;
Step 3, in characteristic sequence X, select its preceding T frame to obtain sequence X T={ x 1, x 2..., x T, with this T frame sequence X TCarry out the deviation h of MAP algorithm self-adaptation acquisition present encoding and standard code MAP, wherein T represents natural number;
Present encoding and standard code deviation h that step 4, usefulness obtain MAPX adjusts compensation to characteristic sequence, obtains new feature vector sequence X, wherein X={x 1-h MAP, x 2-h MAP..., x S-h MAP;
Step 5, with new feature vector sequence X respectively with N standard code mode under speaker model { λ n} N=1 NMate and adjudicate and obtain recognition result.
2, the compensation method of different speech coding influence in the Speaker Identification according to claim 1 is characterized in that according to the MAP algorithm described in the step 3, h MAPMAP be estimated as:
h &OverBar; MAP = arg max h { p ( h | X , &lambda; ) } - - - ( 1 )
Wherein, λ is with reference to speaker model, the preceding T frame sequence X that the X representative is chosen T
According to the monotonicity of Bayesian formula and logarithmic function, formula (1) is equivalent to:
h &OverBar; MAP = arg max h { log p ( X | h , &lambda; ) + log p ( h ) } - - - ( 2 )
Wherein, p (h) is the priori of coding deviation h;
In order to be limited in do not encode the simultaneously priori proportion of deviation h of self-adapting data amount, in formula (2), add and adjust factor-alpha, obtain following formula:
h &OverBar; MAP = arg max h { &alpha; log p ( X | h , &lambda; ) + ( 1 - &alpha; ) log p ( h ) } - - - ( 3 )
Wherein, (X|h λ) satisfies the mixed Gaussian distribution form, that is: to p
p ( X | h , &lambda; ) = &Sigma; i = 1 M p ( X , i | h , &lambda; ) = &Sigma; i = 1 M c i p i ( X | h , &lambda; ) - - - ( 4 )
Wherein, i represents i mixed components, c iRepresent the weight that each mixed components is shared.
CNA2008100646691A 2008-06-04 2008-06-04 Compensation method for different speech coding influence in speaker recognition Pending CN101315771A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100646691A CN101315771A (en) 2008-06-04 2008-06-04 Compensation method for different speech coding influence in speaker recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100646691A CN101315771A (en) 2008-06-04 2008-06-04 Compensation method for different speech coding influence in speaker recognition

Publications (1)

Publication Number Publication Date
CN101315771A true CN101315771A (en) 2008-12-03

Family

ID=40106754

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100646691A Pending CN101315771A (en) 2008-06-04 2008-06-04 Compensation method for different speech coding influence in speaker recognition

Country Status (1)

Country Link
CN (1) CN101315771A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024455B (en) * 2009-09-10 2014-09-17 索尼株式会社 Speaker recognition system and method
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024455B (en) * 2009-09-10 2014-09-17 索尼株式会社 Speaker recognition system and method
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device

Similar Documents

Publication Publication Date Title
CN110211575A (en) Voice for data enhancing adds method for de-noising and system
US11482241B2 (en) Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems
JPH0850499A (en) Signal identification method
CN111161744B (en) Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN110491400B (en) Speech signal reconstruction method based on depth self-encoder
CN101315771A (en) Compensation method for different speech coding influence in speaker recognition
Ion et al. A novel uncertainty decoding rule with applications to transmission error robust speech recognition
Gomez et al. One-pulse FEC coding for robust CELP-coded speech transmission over erasure channels
Martin et al. Estimation of missing LSF parameters using Gaussian mixture models
Lee et al. KLT-based adaptive entropy-constrained quantization with universal arithmetic coding
CN104183239B (en) Method for identifying speaker unrelated to text based on weighted Bayes mixture model
Wan et al. Histogram-based quantization for robust and/or distributed speech recognition
Suh et al. Probabilistic class histogram equalization based on posterior mean estimation for robust speech recognition
Suman et al. Speech enhancement and recognition of compressed speech signal in noisy reverberant conditions
HoChoi et al. Speech recognition method using quantised LSP parameters in CELP-type coders
Ito et al. Designing Side Information of Multiple Description Coding.
Amro et al. Speech compression exploiting linear prediction coefficients codebook and hamming correction code algorithm
KR100984094B1 (en) A voiced/unvoiced decision method for the smv of 3gpp2 using gaussian mixture model
Tseng et al. Quantization for adapted GMM-based speaker verification
Athaudage et al. Model-based speech signal coding using optimized temporal decomposition for storage and broadcasting applications
Kohata et al. Bit rate reduction of the MELP coder using Lempel-Ziv segment quantization
Bozantzis et al. Combined Source Adaptive and Channel Optimized Matrix Quantization for Noisy Channels
Shin et al. Signal modification for ADPCM based on analysis-by-synthesis framework
Petrovsky Audio/Speech Coding Based on the Perceptual Sparse Representation of the Signal with DAE Neural Network Quantizer and Near-End Listening Enhancement
KR101647058B1 (en) Missing Feature Reconstruction Based on HMM of Feature Vectors for Robust Speech Recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20081203