CN101315771A

CN101315771A - Compensation method for different speech coding influence in speaker recognition

Info

Publication number: CN101315771A
Application number: CNA2008100646691A
Authority: CN
Inventors: 韩纪庆; 李雪林
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2008-06-04
Filing date: 2008-06-04
Publication date: 2008-12-03

Abstract

The invention relates to a compensating method of different speech coding influence in speaker recognition, in particular to a compensating method for mismatching of the speech coding in the speaker recognition in the Internet so as to solves the problem of degradation of performance of the speaker recognition caused by the mismatching of training speech and test speech coding in the speaker recognition. The method carries out characteristic processing for a voice signal of the speaker under a standard encoding mode and takes a speaker model under the standard encoding mode obtained by expectation-maximization algorithm training as a match object library; the voice signal of the speaker to be recognized is input and treated with characteristic extraction to obtain a characteristic vector sequence; the front T frames in the characteristic sequence are selected to obtain a sequence and then an MAP algorithm is carried out so as to obtain deviation of a current code and a standard code in a self-adapting way; the original characteristic sequence is adjusted and compensated by using the obtained deviation of the current code and the standard code to obtain a new characteristic vector sequence; the new characteristic vector sequence is used for respectively being matched with the speaker model under the standard encoding mode and judging to obtain a recognition result.

Description

The compensation method of different speech coding influence in the Speaker Identification

Technical field

The present invention relates to the compensation method in a kind of speaker Recognition Technology field, specifically be a kind of on the Internet to Speaker Identification in the compensation method of voice coding when not matching.

Background technology

Speaker Identification is meant by the analyzing and processing to speaker's voice signal, confirms the speaker automatically whether in words person's set of being write down, and confirms further whom the speaker is.Although under breadboard clean speech environment, Speaker Recognition System has obtained reasonable effect, in the middle of real world applications, the performance of Speaker Recognition System will be subjected to the several factors restriction, and the recognition result of system can't be satisfactory.The mismatch problem of speech signal coding when the one of the main reasons that wherein influences performance is the training and testing that causes owing to various factors.Along with the development of modern network technology, more and more by the application of the Internet voice signal.Voice during network transmits adopt more ratio of compression higher relatively in, low rate voice coding or audio coding.The voice of low rate (audio frequency) are though compressed encoding has brought convenience for the transmission of channel, also saved storage space, but because most of voice (audio frequency) coding all is a lossy compression method, voice quality will certainly incur loss, simultaneously, more outstanding is that its coding mechanism of different coded systems is not the same, especially adopts the situation of Streaming Media coded system.Therefore, adopt voice signal after the different coding mode to exist the mismatch problem of aspect such as characteristic parameter, and often we under carrying out network during Speaker Identification, obtainable training data be the signal that adopts under certain voice (audio frequency) coded system, and when reality is used, voice signal to be measured is the signal of other coded systems, at this moment Speaker Identification just is faced with the training and testing voice owing to the different mismatch problems that produce of coding, this will influence the performance of Speaker Identification, for this reason, need research effectively to overcome the compensation method of different speech coding influence.

Summary of the invention

The present invention is for solving in the Speaker Identification process, and the problem that the Speaker Identification performance that training utterance and tested speech coding cause when not matching descends provides the compensation method of different speech coding influence in a kind of Speaker Identification.The present invention is realized by following steps:

Step 1, adopt certain coded system, N speaker carried out speaker model { λ under N the standard code mode that characteristic processing and the training of greatest hope algorithm obtain successively at the voice signal under the standard code mode as the standard code mode _n} _N=1 ^NAs the match objects storehouse, wherein N represents natural number;

Step 2, input speaker's to be identified voice signal s (n) carries out feature extraction to the voice signal of importing and obtains feature vector sequence X={x ₁, x ₂..., x _S, wherein S represents natural number;

Step 3, in feature vector sequence X, select its preceding T frame to obtain sequence X _T={ x ₁, x ₂..., x _T, with this T frame sequence X _TCarry out the deviation h of MAP algorithm self-adaptation acquisition present encoding and standard code _MAP, wherein T represents natural number;

Present encoding and standard code deviation h that step 4, usefulness obtain _MAPX adjusts compensation to feature vector sequence, obtains new feature vector sequence X, wherein X={x ₁-h _MAP, x ₂-h _MAP..., x _S-h _MAP;

Step 5, with new feature vector sequence X respectively with N standard code mode under speaker model { λ _n} _N=1 ^NMate and adjudicate and obtain recognition result.

Beneficial effect: the feature under the coding that the present invention is adopted when adjusting Speaker Identification, make it near the phonetic feature in the match objects storehouse, and employing Gaussian distribution estimated coding deviation, the speaker's phonetic feature distortion that reduces to encode and cause, thereby reduce speaker's voice coding problem that the discrimination that causes reduces that do not match, the system's average recognition rate when coding is not matched has improved 7.1%.

Description of drawings

Fig. 1 is the variation diagram of system recognition rate when the value from 0 to 0.9 of adjusting factor-alpha; Fig. 2 is the variation diagram of system recognition rate when adopting baseline system and maximal posterior probability algorithm to encode compensation respectively, and wherein " → " represents the variation line of the system recognition rate that employing MAP algorithm obtains,

The variation line of the system recognition rate that expression employing baseline system obtains.

Embodiment

Embodiment one: referring to Fig. 1 and Fig. 2, present embodiment is made up of following steps:

Step 1, adopt coding, mp3 coding, rm coding or wherein a kind of coded system of wma coding, N speaker carried out characteristic processing and greatest hope algorithm successively at the voice signal under the standard code mode train speaker's gauss hybrid models { λ under N the standard code mode that obtains as the standard code mode _n} _N=1 ^NAs the match objects storehouse, wherein N represents natural number;

The process of the feature extraction in the step 2 is in the present embodiment: speaker's signal s (n) is carried out sample quantization and pre-emphasis processing, suppose that speaker's signal is stably in short-term, so can carrying out the branch frame, handles speaker's signal, the concrete frame method that divides is that the method that adopts finite length window movably to be weighted realizes, to the voice signal s after the weighting _w(n) calculate linear predictive coding (LPC), obtain feature vector sequence X={x according to the relation between LPC and the linear prediction cepstrum coefficient (LPCC) then ₁, x ₂..., x _S, the relation between LPC and the LPCC is as follows:

\{\begin{matrix} c_{LP (1)} = a_{1} \\ c_{LP} (n) = a_{n} + Σ_{k = 1}^{n - 1} \frac{k}{n} a_{n - k} c_{LP} (k), 1 < n \leq p \\ c_{LP} (n) = Σ_{k = 1}^{p} \frac{k}{n} a_{n - k} c_{LP} (k), n > p \end{matrix}

Wherein, c _LP(n) represent the n of LPCC to tie up component, a _nBe the n dimension component of LPC, p is the dimension of LPC, and n represents natural number.

Step 3 and four computation process are: suppose to have the deviation h that encodes between the coding under the coding and training utterance under the tested speech, this deviation h can be μ with an average _h, covariance matrix is a ∑ _hSingle Gaussian distribution N (μ _h, ∑ _h) represent, according to MAP estimation criterion, h _MAPMAP be estimated as:

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {p (h | X, λ)} - - - (1)

Wherein, λ is with reference to speaker model, the preceding T frame sequence X that the X representative is chosen _T

According to the monotonicity of Bayesian formula and logarithmic function, formula (1) is equivalent to:

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {\log p (X | h, λ) + \log p (h)} - - - (2)

Wherein, p (h) is the priori of coding deviation h.

In order to be limited in do not encode the simultaneously priori proportion of deviation h of self-adapting data amount, in formula (2), add and adjust factor-alpha, obtain following formula:

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {α \log p (X | h, λ) + (1 - α) \log p (h)} - - - (3)

Wherein, (X|h λ) satisfies the mixed Gaussian distribution form, that is: to p

p (X | h, λ) = Σ_{i = 1}^{M} p (X, i | h, λ) = Σ_{i = 1}^{M} c_{i} p_{i} (X | h, λ) - - - (4)

M is 64, and i represents i mixed components, c _iRepresent the weight that each mixed components is shared.

Find the solution formula (3), estimate the present encoding deviation with greatest hope (Expectation Maximum) algorithm in T frame adaptive data centralization, the function that obtains after a series of formula variations of latent status switch Q process for gauss hybrid models is:

Q (h, \overset{&OverBar;}{h}) = α Σ_{t = 1}^{T} Σ_{i = 1}^{M} \frac{p (x_{t}, i | h, λ)}{p (x_{t} | h, λ)} \log p (x_{t}, i | \overset{&OverBar;}{h}, λ) + (1 - α) T \log p (\overset{&OverBar;}{h}) - - - (5)

Wherein, h is previous iteration result; H is current iteration result.x _tIt is the phonetic feature of t frame; P (x _t, i|h, after λ) expression is adjusted t frame voice with deviation h, the probability on i the mixed components of model λ; P (x _t| h, λ) for after adjusting t frame voice with deviation h, the probability on all mixed components of model λ; P (x _t, i|h, λ) for after adjusting t frame voice with deviation h, the probability on i the mixed components of model λ; P (h) is the previous priori of coding deviation h.

Suppose the covariance matrix ∑ of coding deviation h _hGet diagonal matrix, then order

\frac{&PartialD; Q}{&PartialD; {\overset{&OverBar;}{h}}_{j}} = 0,

Have

{\overset{&OverBar;}{h}}_{j} = \frac{α Σ_{t = 1}^{T} [Σ_{i = 1}^{M} \frac{c_{i} p_{i} (x_{i} | h, λ)}{p (x_{t} | h, λ)} \times \frac{(x_{tj} - u_{ij})}{σ_{ij}^{2}}] + (1 - α) \frac{{Tu}_{hj}}{σ_{hj}^{2}}}{α Σ_{t = 1}^{T} [Σ_{i = 1}^{M} \frac{c_{i} p_{i} (x_{i} | h, λ)}{p (x_{t} | h, λ)} \times \frac{1}{σ_{ij}^{2}}] + (1 - α) \frac{T}{σ_{hj}^{2}}} - - - (6)

Wherein, h _jBe the value of the j of current iteration result vector h dimension, j=1,2 ..., L, L are the dimension of eigenvector; x _TjValue for the j dimension of the T frame proper vector of tested speech; μ _Ij, σ _Ij ²Be respectively j average and j the variance of i mixed components of speaker model under the standard code; μ _Hj, σ _Hj ²Be respectively coding deviation h average μ _hThe value and the covariance matrix ∑ of j dimension _hJ value.

In the above in the estimation formulas to the coding deviation, about the μ of priori of coding deviation _Hj, σ _Hj ²Be unknown quantity, thereby carrying out before MAP estimates, the priori of the deviation h that at first needs to obtain to encode.

For the priori of the deviation h that obtains to encode, make that factor-alpha is 1 in the formula (6), at this moment the maximum a posteriori probability method of estimation becomes the maximum likelihood method of estimation, and corresponding iterative formula is as follows:

{\overset{&OverBar;}{h}}_{j} = \frac{α Σ_{t = 1}^{T} [Σ_{i = 1}^{M} \frac{c_{i} p_{i} (x_{t} | h, λ)}{p (x_{t} | h, λ)} \times \frac{(x_{tj} - μ_{ij})}{σ_{ij}^{2}}]}{α Σ_{t = 1}^{T} [Σ_{i = 1}^{M} \frac{c_{i} p_{i} (x_{t} | h, λ)}{p (x_{t} | h, λ)} \times \frac{1}{σ_{ij}^{2}}]} - - - (7)

If H class coding is arranged, can obtain the estimated value of H class coding deviation h by formula (7), be expressed as { h _M1, h _M2..., h _MH, utilize formula (8) and (9) can estimate μ at last _hAnd ∑ _hValue.

u_{h} = \frac{1}{H} Σ_{k = 1}^{H} {\overset{&OverBar;}{h}}_{Mk} - - - (8)

Σ_{h} = \frac{1}{H} Σ_{k = 1}^{H} {({\overset{&OverBar;}{h}}_{Mk} - u_{h})}^{2} - - - (9)

The problem that exists coding deviation h initial value to set in formula (7) is here with the initial value h of the accumulative total of the difference between the average of voice under the current non-standard coding and the reference words person model under the standard code as h ₀, shown in the following formula, c wherein _iWeights for i mixed components of reference speaker model GMM;

h_{0} = \frac{1}{T} Σ_{t = 1}^{T} Σ_{i = 1}^{M} [c_{i} \times (x_{t} - u_{i})] - - - (10)

The estimated value that deviation h has been arranged can be passed through the raw tone feature space of present encoding the feature space of compensation map to standard code, and concrete compensation policy is:

X＝X-h _MAP (11)

Coupling and judging process are in the step 5: for feature vector sequence X, and the new feature vector sequence X in the formula after the X representative compensation, this artificial N speaker's posterior probability:

p (λ_{n} | X) = \frac{p (X | λ_{n}) p (λ_{n})}{p (X)} = \frac{p (X | λ_{n}) p (λ_{n})}{Σ_{m = 1}^{N} p (X | λ_{n}) p (λ_{m})} - - - (12)

Wherein, p (λ _n) be N the prior probability that the people speaks; P (X) is the probability density of feature vector sequence X under N speaker's condition in the match objects storehouse; P (X| λ _n) be the class conditional probability that N people produces feature vector sequence X.The maximum posteriori criterion of recognition result:

n^{*} = \underset{1 \leq n \leq N}{\arg \max} p (λ_{n} | X) - - - (13)

N wherein ^*Expression identification court verdict.Suppose that the prior probability that everyone speaks equates to obtain:

p (λ_{n}) = \frac{1}{N}, n = 1, 2, . . ., N - - - (14)

For each speaker, the p (X) in the formula (12) equates in addition.Like this, formula (13) can be write as

n^{*} = \arg \max_{1 \leq n \leq N} p (X | λ_{n}) - - - (15)

At this moment, maximum posteriori criterion has just changed into maximum likelihood criterion.

Usually in order to simplify calculating, generally adopt log-likelihood function, court verdict is:

n^{*} = \arg \max_{1 \leq n \leq N} \ln p (X | λ_{n}) - - - (16)

Formula (16) is exactly closed set test decision rule.The closed set test only is discussed here, is avoided the influence of opener test threshold to discrimination, the unmatched influence of outstanding coding is to reduce the problem complexity.

Claims

1, the compensation method of different speech coding influence in the Speaker Identification is characterized in that it is realized by following steps:

Step 3, in characteristic sequence X, select its preceding T frame to obtain sequence X _T={ x ₁, x ₂..., x _T, with this T frame sequence X _TCarry out the deviation h of MAP algorithm self-adaptation acquisition present encoding and standard code _MAP, wherein T represents natural number;

Present encoding and standard code deviation h that step 4, usefulness obtain _MAPX adjusts compensation to characteristic sequence, obtains new feature vector sequence X, wherein X={x ₁-h _MAP, x ₂-h _MAP..., x _S-h _MAP;

2, the compensation method of different speech coding influence in the Speaker Identification according to claim 1 is characterized in that according to the MAP algorithm described in the step 3, h _MAPMAP be estimated as:

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {p (h | X, λ)} - - - (1)

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {\log p (X | h, λ) + \log p (h)} - - - (2)

Wherein, p (h) is the priori of coding deviation h;

{\overset{&OverBar;}{h}}_{MAP} = \underset{h}{\arg \max} {α \log p (X | h, λ) + (1 - α) \log p (h)} - - - (3)

Wherein, (X|h λ) satisfies the mixed Gaussian distribution form, that is: to p

p (X | h, λ) = Σ_{i = 1}^{M} p (X, i | h, λ) = Σ_{i = 1}^{M} c_{i} p_{i} (X | h, λ) - - - (4)

Wherein, i represents i mixed components, c _iRepresent the weight that each mixed components is shared.