CN1221939C

CN1221939C - Speaker self-adaptive method in speech recognition system

Info

Publication number: CN1221939C
Application number: CNB031022065A
Authority: CN
Inventors: 吴及; 王作英; 吕萍
Original assignee: TIANLANG SPEECH SOUND SCI-TECH Co Ltd BEIJING; Tsinghua University
Current assignee: TIANLANG SPEECH SOUND SCI-TECH Co Ltd BEIJING; Tsinghua University
Priority date: 2003-01-27
Filing date: 2003-01-27
Publication date: 2005-10-05
Anticipated expiration: 2023-01-27
Also published as: CN1521728A

Abstract

The present invention provides a speaker self-adaptive method in a speech recognition system, which is called as a linear interpolation method for covariance matrices via a maximum gauss semblance and can overcome defects of a bifurcation decision tree method based on the gauss semblance under the condition of little self-adaptive data. The present invention comprises the following main steps: before self-adaptation: firstly, like the bifurcation decision tree self-adaptive method analyzing based on the gauss semblance, building a bifurcation decision tree for the covariance matrices according to non-specified person models; then, calculating category central matrices corresponding to intermediate nodes under specified person models according to the decision tree; in self-adaptation: firstly, determining which intermediate nodes to carry out interpolation self-adaptation according to a data quantity provided by a tester; then, according to self-adaptive data corresponding to the intermediate node of each interpolation, calculating interpolation factors; finally, calculating the category central matrices after self-adaptation and updating covariance matrices to obtain a self-adaptive model.

Description

Speaker adaptation method in the speech recognition system

Technical field

The present invention relates to the speaker adaptation method in a kind of speech recognition technology field, relate in particular to a kind of speaker adaptation method covariance matrix.

Background technology

Speech recognition technology has been obtained significant progress through the development of over half a century, walks out laboratory environment gradually and enters into practical application.Wherein, (Speaker Dependent, SD) (application space of speech recognition technology has greatly been expanded in Speaker Independent, SI) speech recognition to unspecified person in speech recognition from specific people.Yet, with regard to same speaker, the performance of SI system is usually well below the sufficient SD of training system, this be because, the acoustic model of SD system is to be trained by single speaker's data to obtain, and has well reacted this speaker's characteristic, has then comprised different speakers' as much as possible speech data in the training set of SI system, corresponding acoustic model is many speakers' a level and smooth model, so the reduction of SI system identification performance is difficult to avoid.For remedying this defective of SI system, people's speaker adaptation technology that begins one's study.The target of speaker adaptation technology is, utilizes new speaker's speech data to adjust phonetic feature or acoustic model parameter, makes it as much as possible and new speaker's coupling, makes the system performance convergence SD system performance as far as possible after the self-adaptation.

Model adaptation is the technology of normal employing in the speaker adaptation, as shown in Figure 1, the speech data that it provides according to new speaker, adjust the acoustic model parameter (average or covariance) of SI system according to certain transformation relation, the model of this moment is called speaker adaptation (Speaker adapted, SA) model.After obtaining the SA model, system will discern other voice signal of this speaker by it, and such system is called speaker adaptation system (shown in Figure 2).

Since the 1980s, multiple speaker model adaptive approach has been proposed.Roughly can be divided into two big classes: based on Bayesian Estimation and based on the self-adaptation of conversion; Corresponding typical algorithm has: (Maximum A Posterior is MAP) with linear (MaximumLikelihood Linear Regression, the MLLR) algorithm of returning of maximum likelihood for maximum a posteriori probability.Along with the development of speech recognition in application, quick self-adapted technology receives increasing concern, and its basic thought is: in conjunction with MAP and MLLR algorithm, make full use of the correlativity between voice recognition unit, reduce the number of parameter estimation.

Belong to adaptive approach based on the binary decision tree adaptive approach of gaussian similarity analysis based on conversion, carry out self-adaptation at covariance matrix, its basic thought is: one group of more similar covariance matrix is no matter before self-adaptation or after the self-adaptation, their similarity relation is constant, therefore they share identical transformation equation when self-adaptation, and this group covariance matrix is dynamically determined by binary decision tree.This method has proposed a kind of low volume data that utilizes and has carried out the adaptive a kind of effective ways of covariance matrix, the effect that makes the recognition performance of adaptive model finally can approach specific human model.But this method also has weak point, promptly in adaptive process, at least need to estimate a Centroid matrix (being the root node matrix), under the few situation of self-adapting data, be difficult to stably estimate a matrix, then cause the self-adaptation effect born, promptly the system performance after the self-adaptation can be lower than the baseline system performance on the contrary.

Summary of the invention

The objective of the invention is to propose a kind of new quick covariance adaptive approach, to overcome based on the shortcoming under the few situation of self-adapting data in the binary decision tree method of Gauss's similarity.

For achieving the above object, the present invention is achieved in that the training step that the present invention includes the preceding unspecified person hidden Markov model of a self-adaptation;

Set up the step of the binary decision tree of this unspecified person hidden Markov model state covariance matrix before self-adaptation;

Calculate before self-adaptation each intermediate node of binary decision tree class center covariance matrix and with corresponding each leaf node covariance matrix between the step of transformation relation;

The training step of the preceding a plurality of specific people's hidden Markov models of self-adaptation;

The preceding step of calculating the class center matrix of intermediate node correspondence under each specific human model according to this binary decision tree of self-adaptation;

The step of the self-adapting data decision self-adaptation class that provides according to the tester during self-adaptation;

A step of each self-adaptation class being estimated the class center matrix according to self-adapting data with maximum likelihood method;

A step of each self-adaptation class being calculated the best interpolation coefficient;

One is calculated class center covariance matrix after self-adaptation to each self-adaptation class with the maximum likelihood estimator of specific human model class center matrix and its corresponding interpolation coefficient;

A covariance matrix that upgrades each self-adaptation class obtains the step of speaker adaptation model.

In the described step to each self-adaptation class calculating best interpolation coefficient, calculation criterion is Gauss's similarity maximum, that is: make the class center matrix of the intermediate node that obtains by linear interpolation, the similarity degree maximum of the class center matrix that obtains with the previous step of this step.

In the present invention, the said covariance matrix that is used for interpolation, it not merely is the covariance matrix of HMM state, the covariance matrix that also comprises each intermediate node class center of binary tree, intermediate node has been represented all leaf nodes of its correspondence, after event was carried out interpolation to it, all leaf covariance matrixes of its correspondence all can be by self-adaptation.Another advantage of utilizing the class barycenter of intermediate node to carry out interpolation is, can according to the self-adapting data amount what dynamically decision be used for the numbers of interpolation intermediate node, like this guarantee quick self-adapted in, improve the gradual of algorithm.

Description of drawings

Fig. 1 is the schematic flow sheet of model self-adapting method;

Fig. 2 is the process flow diagram through the speech recognition system behind the model adaptation;

Fig. 3 is the preceding process flow diagram of self-adaptation of the embodiment of the invention;

Fig. 4 is for setting up the process flow diagram of binary decision tree in the embodiment of the invention;

Fig. 5 is the process flow diagram of K Mean Method division node shown in Figure 4;

Process flow diagram when Fig. 6 is the self-adaptation of the embodiment of the invention.

Embodiment

The present invention is further elaborated below in conjunction with the drawings and specific embodiments:

Fig. 3 is to a kind of embodiment that has realized optimum of the present invention embodiment illustrated in fig. 6.

Before the self-adaptation, unspecified person hidden Markov model of training shown in Figure 3 (hereinafter to be referred as the SI model), be distance measure (being Gauss's similarity) between covariance matrix with formula (3) then, adopt top-down K Mean Method to set up the binary decision tree of a hidden Markov model (HMM) state covariance matrix, and calculate the transformation relation A between each state and class center covariance matrix _{I, Φ}, as shown in Figure 4, will treat that earlier the covariance matrix of all state correspondences of adaptive model is put into root node, calculate the center matrix C of this node according to formula (1) _ΦThen root node is divided into two child nodes with the K mean algorithm, the multiple division process, if the status number in the present node has decomposed inadequately or when being lower than predefined thresholding just with this node as leaf node, otherwise repeat above-mentioned fission process until obtaining all leaf nodes, the corresponding covariance matrix of leaf node is at last according to formula (2) compute classes center matrix C _ΦWith the transformation relation battle array A between corresponding each leaf node covariance matrix _{I, Φ}

N wherein _ΦIt is the number (1) of leaf node among the set Φ

A_{i, Φ} = Σ_{i}^{- 1 / 2} {[Σ_{i}^{1 / 2} C_{Φ} Σ_{i}^{1 / 2}]}^{1 / 2} Σ_{i}^{- 1 / 2} - - - i &Element; Φ - - - (2)

d (x, y) = tr (Σ_{x} + Σ_{y} - 2 {[Σ_{x}^{1 / 2} Σ_{y} Σ_{x}^{1 / 2}]}^{1 / 2}) - - - (3)

Being described below of K Mean Method: n some X arranged in the space ₁, X ₂X _n, the number K of given class (K=2 among the present invention), establishing these classes is C ₁, C ₂C _K, n point assigned to and gone in K the class, make similarity maximum between the interior object of class, and the similarity minimum between the class.As Fig. 5, the steps include:

1, chooses K initial classes center earlier, be designated as C ₁, C ₂C _K

2, according to function (3) calculate respectively each put these class centers apart from d (X _i, c _j), seek the minimum class center c of distance _l, that is: d (X _i, c _l)≤d (X _i, c _j), j ∈ 1,2 ... K, j ≠ l, 1≤l≤K then thinks X _i∈ C _l, i.e. X _iBe the point that belongs to the l class, so, determine the ownership of being had a few;

3, calculate total distance measure:

D = Σ_{i, j}^{n, m} \min_{1 \leq j \leq k} d (X_{i}, c_{j});

4,, utilize the point of every class to recomputate the class center according to sorting result.

5, utilize new class center, the ownership of computer memory each point again, and calculate total distance measure D of renewal _New

6, the total distance measure that relatively obtains for twice if difference is enough little, then stops iteration, gets to the end mode classification and class center, otherwise continues iteration, repeats the 2-5 step.

Like this, we are just by treating adaptive HMM model parameter, according to Gauss similar estimate to have set up the binary decision tree of HMM state observation probability in feature space distribution shape relation described, the included state of each node on this decision tree is the less state of distance between observation probability under Gauss's similarity meaning distributes, and promptly these are distributed on the distribution shape of feature space more similar.In fact this binary tree is exactly a kind of structural description of state observation probability distribution at feature space.

Secondly, before self-adaptation, train a plurality of specific people's hidden Markov models (hereinafter to be referred as the SD model), then, calculate the class center C of intermediate node correspondence under each SD model according to above-mentioned decision tree ^(s) _{Φ j}, (s=1 ..., S; J=1 ..., J), wherein S is the number of SD model, J is the sum of intermediate node, as shown in Figure 3.

During self-adaptation, at first the self-adapting data that provides according to the tester determines self-adaptation class quantity, method is: the speech samples number of adding up each leaf node earlier according to self-adapting data, if number is less than pre-determined threshold value, then pass up to its father node, add up all speech samples numbers of father node again, as then stopping greater than thresholding, otherwise continue, stop until all leaf nodes are recalled, this moment, we obtained being applicable to the state class of this batch self-adapting data.Thisly come the method for Dynamic Selection state class, be referred to as data-driven by self-adapting data.The selection of above-mentioned thresholding is vital for obtain best self-adaptation effect from limited self-adapting data, because self-adapting data is limited, so if thresholding is less, just may there be abundant data to carry out the estimation at class center in the state class, make the covariance matrix instability that estimates, thereby influence adaptive effect.If thresholding is too big, the state class of determining is very few for number, can make the state observation probability distribution in the description of feature space structural relation too in coarse, also is difficult to receive good self-adaptation effect.Experiment shows, under self-adapting data ten minutes condition of limited, is proper between the speech samples thresholding is taken at 350 to 450.Certainly, the increase of self-adapting data is can be beneficial for the self-adaptation effect, the number that opposite extreme situations is a state class equals status number, be to have only a state in each state class, each state all has enough data to carry out parameter estimation simultaneously, this situation just has been equal to specific people's covariance matrix model training, and certainly, so many speech data only just may occur when unsupervised progression self-adaptation.This also illustrates, limiting performance of the present invention is the performance that can be tending towards specific human model.

Estimate to obtain the class center matrix for each self-adaptation class according to self-adapting data and maximum likelihood method then Concrete grammar is described below: suppose that the leaf node that certain node comprises in the binary decision tree (being the HMM state) is s ₁, s ₂..., s _n, each state is according to corresponding self-adapting data, by formula (4) statistics second-order statistic C (s _i):

C (s_{i}) = Σ_{t = 1}^{T (s_{i})} (o_{t} - μ_{s_{1}}) {(o_{t} - μ_{s_{1}})}^{T} - - - (4)

T (s wherein _i) be state s _iCorresponding self-adaptation totalframes.

Transform in the space of intermediate node correspondence according to the statistics second-order statistic of formula (5) again, and obtain maximal possibility estimation each state

{\tilde{C}}_{Φ} = \frac{1}{\underset{i &Element; Φ}{Σ} T (s_{i})} \underset{i &Element; Φ}{Σ} {(A_{i, Φ})}^{- 1} C (s_{i}) {(A_{i, Φ})}^{- 1} - - - (5)

Then each self-adaptation class is calculated the best interpolation coefficient, calculation criterion is Gauss's similarity maximum, makes the class center matrix of the intermediate node that is obtained by linear interpolation that is:, the class center matrix that obtains with the previous step of this step

The similarity degree maximum.Suppose to have only a self-adaptation class, the objective function in this algorithm is formula (6):

J (α) = tr ({\tilde{C}}_{Φ} + Σ_{s = 1}^{S} α_{s} {C_{Φ}}^{(s)} - 2 {[{({\tilde{C}}_{Φ})}^{1 / 2} (Σ_{s = 1}^{S} α_{s} {C_{Φ}}^{(s)}) {({\tilde{C}}_{Φ})}^{1 / 2}]}^{1 / 2}) - - - (6)

Find the solution interpolation coefficient with gradient projection method, and mainly will calculate two derivatives in the gradient projection method: be i.e. derivative shown in formula (7) and (8).(7) and (8) formula of utilization obtains optimum combination coefficient

α^{*} = \underset{α &Element; Ω}{\arg \min} J (α) .

{&dtri;}_{α} J (α) = [\begin{matrix} tr ({C_{Φ}}^{(1)}) \\ tr ({C_{Φ}}^{(2)}) \\ tr ({C_{Φ}}^{(S)}) \end{matrix}] - [\begin{matrix} tr ({\tilde{C}}_{Φ}^{1 / 2} {(Σ_{s = 1}^{S} α_{s} {C_{Φ}}^{(s)})}^{- 1 / 2} {C_{Φ}}^{(1)}) \\ tr ({\tilde{C}}_{Φ}^{1 / 2} {(Σ_{s = 1}^{S} α_{s} {C_{Φ}}^{(s)})}^{- 1 / 2} {C_{Φ}}^{(2)}) \\ \cdot \\ \cdot \\ \cdot \\ tr ({\tilde{C}}_{Φ}^{1 / 2} {(Σ_{s = 1}^{S} α_{s} {C_{Φ}}^{(s)})}^{- 1 / 2} {C_{Φ}}^{(S)}) \end{matrix}] - - - (7)

{&dtri;}_{τ} J (α + τd) = Σ_{s = 1}^{S} tr (d_{s} {C_{Φ}}^{(s)}) - Σ_{s = 1}^{S} tr (d_{s} \cdot {({\tilde{C}}_{Φ})}^{1 / 2} {(Σ_{s = 1}^{S} (α_{s} + {τd}_{s}) {C_{Φ}}^{(s)})}^{- 1 / 2} {C_{Φ}}^{(s)}) - - - (8)

Class center C with the interpolation coefficient that obtains and each SD model _Φ ^(s), (s=1 ... S), according to the class center C after (9) formula calculating self-adaptation _Φ ^(SA)

{C_{Φj}}^{(SA)} = Σ_{s = 1}^{S} α_{s, j} \cdot {C_{Φj}}^{(s)} - - - (j = 1, . . ., N_{J}) - - - (9)

Wherein:

J represents intermediate node;

N _JBe the sum of the intermediate node of interpolation, promptly total self-adaptation class number is dynamically determined by self-adapting data;

Φ _jThe set of the leaf node (being state) of expression node j correspondence;

C _{Φ j} ^(s)J intermediate node representing s SD model; S=1,2 ..., S, S are total SD pattern number;

α _j={ α _{S, j}| s=1,2 .., S} represents the linear interpolation coefficient of j intermediate node correspondence.

Use the class center C after the self-adaptation that a step obtains _Φ ^(SA), upgrade covariance matrix according to (10) formula, obtain speaker adaptation model (SA model).The present invention's self-adaptation covariance matrix, the mean value vector in the model remains unchanged.

{Σ_{i}}^{(SA)} = A_{i, Φj} {C_{Φj}}^{(SA)} A_{i, Φj} - - - - i &Element; Φ_{j} - - - (10)

The present invention is applied to the Maximum Likelihood Model interpolation algorithm on covariance matrix quick self-adapted, so can be described as largest Gaussian one similarity covariance matrix linear interpolation method again, it has overcome binary decision tree method based on Gauss's similarity under the few situation of self-adapting data, owing to being difficult to stably to estimate that a matrix causes the defective of the self-adaptation effect born, have very big promotion and application and be worth.

Claims

1, the speaker adaptation method in a kind of speech recognition system comprises:

The training step of a preceding unspecified person hidden Markov model of self-adaptation;

Step to the class center covariance matrix after each self-adaptation class calculating self-adaptation;

A covariance matrix that upgrades each self-adaptation class obtains the step of speaker adaptation model;

It is characterized in that: the speaker adaptation method in the described speech recognition system also comprises:

The preceding step of calculating the class center matrix of intermediate node correspondence under each specific human model according to described binary decision tree of self-adaptation;

During a self-adaptation each self-adaptation class is calculated the step of best interpolation coefficient, this calculation criterion is Gauss's similarity maximum, that is: make the class center matrix of the intermediate node that obtains by linear interpolation, the similarity degree maximum of the class center matrix that obtains with the previous step of this step;

The described step that each self-adaptation class is calculated the class center covariance matrix after self-adaptation is that maximum likelihood estimator and its corresponding interpolation coefficient with specific human model class center matrix calculates.