CN1221939C - Speaker self-adaptive method in speech recognition system - Google Patents
Speaker self-adaptive method in speech recognition system Download PDFInfo
- Publication number
- CN1221939C CN1221939C CNB031022065A CN03102206A CN1221939C CN 1221939 C CN1221939 C CN 1221939C CN B031022065 A CNB031022065 A CN B031022065A CN 03102206 A CN03102206 A CN 03102206A CN 1221939 C CN1221939 C CN 1221939C
- Authority
- CN
- China
- Prior art keywords
- self
- adaptation
- class
- decision tree
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000003066 decision tree Methods 0.000 claims abstract description 20
- 239000011159 matrix material Substances 0.000 claims description 50
- 230000006978 adaptation Effects 0.000 claims description 15
- 238000007476 Maximum Likelihood Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract 1
- 230000003044 adaptive effect Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004992 fission Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The present invention provides a speaker self-adaptive method in a speech recognition system, which is called as a linear interpolation method for covariance matrices via a maximum gauss semblance and can overcome defects of a bifurcation decision tree method based on the gauss semblance under the condition of little self-adaptive data. The present invention comprises the following main steps: before self-adaptation: firstly, like the bifurcation decision tree self-adaptive method analyzing based on the gauss semblance, building a bifurcation decision tree for the covariance matrices according to non-specified person models; then, calculating category central matrices corresponding to intermediate nodes under specified person models according to the decision tree; in self-adaptation: firstly, determining which intermediate nodes to carry out interpolation self-adaptation according to a data quantity provided by a tester; then, according to self-adaptive data corresponding to the intermediate node of each interpolation, calculating interpolation factors; finally, calculating the category central matrices after self-adaptation and updating covariance matrices to obtain a self-adaptive model.
Description
Technical field
The present invention relates to the speaker adaptation method in a kind of speech recognition technology field, relate in particular to a kind of speaker adaptation method covariance matrix.
Background technology
Speech recognition technology has been obtained significant progress through the development of over half a century, walks out laboratory environment gradually and enters into practical application.Wherein, (Speaker Dependent, SD) (application space of speech recognition technology has greatly been expanded in Speaker Independent, SI) speech recognition to unspecified person in speech recognition from specific people.Yet, with regard to same speaker, the performance of SI system is usually well below the sufficient SD of training system, this be because, the acoustic model of SD system is to be trained by single speaker's data to obtain, and has well reacted this speaker's characteristic, has then comprised different speakers' as much as possible speech data in the training set of SI system, corresponding acoustic model is many speakers' a level and smooth model, so the reduction of SI system identification performance is difficult to avoid.For remedying this defective of SI system, people's speaker adaptation technology that begins one's study.The target of speaker adaptation technology is, utilizes new speaker's speech data to adjust phonetic feature or acoustic model parameter, makes it as much as possible and new speaker's coupling, makes the system performance convergence SD system performance as far as possible after the self-adaptation.
Model adaptation is the technology of normal employing in the speaker adaptation, as shown in Figure 1, the speech data that it provides according to new speaker, adjust the acoustic model parameter (average or covariance) of SI system according to certain transformation relation, the model of this moment is called speaker adaptation (Speaker adapted, SA) model.After obtaining the SA model, system will discern other voice signal of this speaker by it, and such system is called speaker adaptation system (shown in Figure 2).
Since the 1980s, multiple speaker model adaptive approach has been proposed.Roughly can be divided into two big classes: based on Bayesian Estimation and based on the self-adaptation of conversion; Corresponding typical algorithm has: (Maximum A Posterior is MAP) with linear (MaximumLikelihood Linear Regression, the MLLR) algorithm of returning of maximum likelihood for maximum a posteriori probability.Along with the development of speech recognition in application, quick self-adapted technology receives increasing concern, and its basic thought is: in conjunction with MAP and MLLR algorithm, make full use of the correlativity between voice recognition unit, reduce the number of parameter estimation.
Belong to adaptive approach based on the binary decision tree adaptive approach of gaussian similarity analysis based on conversion, carry out self-adaptation at covariance matrix, its basic thought is: one group of more similar covariance matrix is no matter before self-adaptation or after the self-adaptation, their similarity relation is constant, therefore they share identical transformation equation when self-adaptation, and this group covariance matrix is dynamically determined by binary decision tree.This method has proposed a kind of low volume data that utilizes and has carried out the adaptive a kind of effective ways of covariance matrix, the effect that makes the recognition performance of adaptive model finally can approach specific human model.But this method also has weak point, promptly in adaptive process, at least need to estimate a Centroid matrix (being the root node matrix), under the few situation of self-adapting data, be difficult to stably estimate a matrix, then cause the self-adaptation effect born, promptly the system performance after the self-adaptation can be lower than the baseline system performance on the contrary.
Summary of the invention
The objective of the invention is to propose a kind of new quick covariance adaptive approach, to overcome based on the shortcoming under the few situation of self-adapting data in the binary decision tree method of Gauss's similarity.
For achieving the above object, the present invention is achieved in that the training step that the present invention includes the preceding unspecified person hidden Markov model of a self-adaptation;
Set up the step of the binary decision tree of this unspecified person hidden Markov model state covariance matrix before self-adaptation;
Calculate before self-adaptation each intermediate node of binary decision tree class center covariance matrix and with corresponding each leaf node covariance matrix between the step of transformation relation;
The training step of the preceding a plurality of specific people's hidden Markov models of self-adaptation;
The preceding step of calculating the class center matrix of intermediate node correspondence under each specific human model according to this binary decision tree of self-adaptation;
The step of the self-adapting data decision self-adaptation class that provides according to the tester during self-adaptation;
A step of each self-adaptation class being estimated the class center matrix according to self-adapting data with maximum likelihood method;
A step of each self-adaptation class being calculated the best interpolation coefficient;
One is calculated class center covariance matrix after self-adaptation to each self-adaptation class with the maximum likelihood estimator of specific human model class center matrix and its corresponding interpolation coefficient;
A covariance matrix that upgrades each self-adaptation class obtains the step of speaker adaptation model.
In the described step to each self-adaptation class calculating best interpolation coefficient, calculation criterion is Gauss's similarity maximum, that is: make the class center matrix of the intermediate node that obtains by linear interpolation, the similarity degree maximum of the class center matrix that obtains with the previous step of this step.
In the present invention, the said covariance matrix that is used for interpolation, it not merely is the covariance matrix of HMM state, the covariance matrix that also comprises each intermediate node class center of binary tree, intermediate node has been represented all leaf nodes of its correspondence, after event was carried out interpolation to it, all leaf covariance matrixes of its correspondence all can be by self-adaptation.Another advantage of utilizing the class barycenter of intermediate node to carry out interpolation is, can according to the self-adapting data amount what dynamically decision be used for the numbers of interpolation intermediate node, like this guarantee quick self-adapted in, improve the gradual of algorithm.
Description of drawings
Fig. 1 is the schematic flow sheet of model self-adapting method;
Fig. 2 is the process flow diagram through the speech recognition system behind the model adaptation;
Fig. 3 is the preceding process flow diagram of self-adaptation of the embodiment of the invention;
Fig. 4 is for setting up the process flow diagram of binary decision tree in the embodiment of the invention;
Fig. 5 is the process flow diagram of K Mean Method division node shown in Figure 4;
Process flow diagram when Fig. 6 is the self-adaptation of the embodiment of the invention.
Embodiment
The present invention is further elaborated below in conjunction with the drawings and specific embodiments:
Fig. 3 is to a kind of embodiment that has realized optimum of the present invention embodiment illustrated in fig. 6.
Before the self-adaptation, unspecified person hidden Markov model of training shown in Figure 3 (hereinafter to be referred as the SI model), be distance measure (being Gauss's similarity) between covariance matrix with formula (3) then, adopt top-down K Mean Method to set up the binary decision tree of a hidden Markov model (HMM) state covariance matrix, and calculate the transformation relation A between each state and class center covariance matrix
I, Φ, as shown in Figure 4, will treat that earlier the covariance matrix of all state correspondences of adaptive model is put into root node, calculate the center matrix C of this node according to formula (1)
ΦThen root node is divided into two child nodes with the K mean algorithm, the multiple division process, if the status number in the present node has decomposed inadequately or when being lower than predefined thresholding just with this node as leaf node, otherwise repeat above-mentioned fission process until obtaining all leaf nodes, the corresponding covariance matrix of leaf node is at last according to formula (2) compute classes center matrix C
ΦWith the transformation relation battle array A between corresponding each leaf node covariance matrix
I, Φ
N wherein
ΦIt is the number (1) of leaf node among the set Φ
Being described below of K Mean Method: n some X arranged in the space
1, X
2X
n, the number K of given class (K=2 among the present invention), establishing these classes is C
1, C
2C
K, n point assigned to and gone in K the class, make similarity maximum between the interior object of class, and the similarity minimum between the class.As Fig. 5, the steps include:
1, chooses K initial classes center earlier, be designated as C
1, C
2C
K
2, according to function (3) calculate respectively each put these class centers apart from d (X
i, c
j), seek the minimum class center c of distance
l, that is: d (X
i, c
l)≤d (X
i, c
j), j ∈ 1,2 ... K, j ≠ l, 1≤l≤K then thinks X
i∈ C
l, i.e. X
iBe the point that belongs to the l class, so, determine the ownership of being had a few;
3, calculate total distance measure:
4,, utilize the point of every class to recomputate the class center according to sorting result.
5, utilize new class center, the ownership of computer memory each point again, and calculate total distance measure D of renewal
New
6, the total distance measure that relatively obtains for twice if difference is enough little, then stops iteration, gets to the end mode classification and class center, otherwise continues iteration, repeats the 2-5 step.
Like this, we are just by treating adaptive HMM model parameter, according to Gauss similar estimate to have set up the binary decision tree of HMM state observation probability in feature space distribution shape relation described, the included state of each node on this decision tree is the less state of distance between observation probability under Gauss's similarity meaning distributes, and promptly these are distributed on the distribution shape of feature space more similar.In fact this binary tree is exactly a kind of structural description of state observation probability distribution at feature space.
Secondly, before self-adaptation, train a plurality of specific people's hidden Markov models (hereinafter to be referred as the SD model), then, calculate the class center C of intermediate node correspondence under each SD model according to above-mentioned decision tree
(s) Φ j, (s=1 ..., S; J=1 ..., J), wherein S is the number of SD model, J is the sum of intermediate node, as shown in Figure 3.
During self-adaptation, at first the self-adapting data that provides according to the tester determines self-adaptation class quantity, method is: the speech samples number of adding up each leaf node earlier according to self-adapting data, if number is less than pre-determined threshold value, then pass up to its father node, add up all speech samples numbers of father node again, as then stopping greater than thresholding, otherwise continue, stop until all leaf nodes are recalled, this moment, we obtained being applicable to the state class of this batch self-adapting data.Thisly come the method for Dynamic Selection state class, be referred to as data-driven by self-adapting data.The selection of above-mentioned thresholding is vital for obtain best self-adaptation effect from limited self-adapting data, because self-adapting data is limited, so if thresholding is less, just may there be abundant data to carry out the estimation at class center in the state class, make the covariance matrix instability that estimates, thereby influence adaptive effect.If thresholding is too big, the state class of determining is very few for number, can make the state observation probability distribution in the description of feature space structural relation too in coarse, also is difficult to receive good self-adaptation effect.Experiment shows, under self-adapting data ten minutes condition of limited, is proper between the speech samples thresholding is taken at 350 to 450.Certainly, the increase of self-adapting data is can be beneficial for the self-adaptation effect, the number that opposite extreme situations is a state class equals status number, be to have only a state in each state class, each state all has enough data to carry out parameter estimation simultaneously, this situation just has been equal to specific people's covariance matrix model training, and certainly, so many speech data only just may occur when unsupervised progression self-adaptation.This also illustrates, limiting performance of the present invention is the performance that can be tending towards specific human model.
Estimate to obtain the class center matrix for each self-adaptation class according to self-adapting data and maximum likelihood method then
Concrete grammar is described below: suppose that the leaf node that certain node comprises in the binary decision tree (being the HMM state) is s
1, s
2..., s
n, each state is according to corresponding self-adapting data, by formula (4) statistics second-order statistic C (s
i):
T (s wherein
i) be state s
iCorresponding self-adaptation totalframes.
Transform in the space of intermediate node correspondence according to the statistics second-order statistic of formula (5) again, and obtain maximal possibility estimation each state
Then each self-adaptation class is calculated the best interpolation coefficient, calculation criterion is Gauss's similarity maximum, makes the class center matrix of the intermediate node that is obtained by linear interpolation that is:, the class center matrix that obtains with the previous step of this step
The similarity degree maximum.Suppose to have only a self-adaptation class, the objective function in this algorithm is formula (6):
Find the solution interpolation coefficient with gradient projection method, and mainly will calculate two derivatives in the gradient projection method: be i.e. derivative shown in formula (7) and (8).(7) and (8) formula of utilization obtains optimum combination coefficient
Class center C with the interpolation coefficient that obtains and each SD model
Φ (s), (s=1 ... S), according to the class center C after (9) formula calculating self-adaptation
Φ (SA)
Wherein:
J represents intermediate node;
N
JBe the sum of the intermediate node of interpolation, promptly total self-adaptation class number is dynamically determined by self-adapting data;
Φ
jThe set of the leaf node (being state) of expression node j correspondence;
C
Φ j (s)J intermediate node representing s SD model; S=1,2 ..., S, S are total SD pattern number;
α
j={ α
S, j| s=1,2 .., S} represents the linear interpolation coefficient of j intermediate node correspondence.
Use the class center C after the self-adaptation that a step obtains
Φ (SA), upgrade covariance matrix according to (10) formula, obtain speaker adaptation model (SA model).The present invention's self-adaptation covariance matrix, the mean value vector in the model remains unchanged.
The present invention is applied to the Maximum Likelihood Model interpolation algorithm on covariance matrix quick self-adapted, so can be described as largest Gaussian one similarity covariance matrix linear interpolation method again, it has overcome binary decision tree method based on Gauss's similarity under the few situation of self-adapting data, owing to being difficult to stably to estimate that a matrix causes the defective of the self-adaptation effect born, have very big promotion and application and be worth.
Claims (1)
1, the speaker adaptation method in a kind of speech recognition system comprises:
The training step of a preceding unspecified person hidden Markov model of self-adaptation;
Set up the step of the binary decision tree of this unspecified person hidden Markov model state covariance matrix before self-adaptation;
Calculate before self-adaptation each intermediate node of binary decision tree class center covariance matrix and with corresponding each leaf node covariance matrix between the step of transformation relation;
The step of the self-adapting data decision self-adaptation class that provides according to the tester during self-adaptation;
A step of each self-adaptation class being estimated the class center matrix according to self-adapting data with maximum likelihood method;
Step to the class center covariance matrix after each self-adaptation class calculating self-adaptation;
A covariance matrix that upgrades each self-adaptation class obtains the step of speaker adaptation model;
It is characterized in that: the speaker adaptation method in the described speech recognition system also comprises:
The training step of the preceding a plurality of specific people's hidden Markov models of self-adaptation;
The preceding step of calculating the class center matrix of intermediate node correspondence under each specific human model according to described binary decision tree of self-adaptation;
During a self-adaptation each self-adaptation class is calculated the step of best interpolation coefficient, this calculation criterion is Gauss's similarity maximum, that is: make the class center matrix of the intermediate node that obtains by linear interpolation, the similarity degree maximum of the class center matrix that obtains with the previous step of this step;
The described step that each self-adaptation class is calculated the class center covariance matrix after self-adaptation is that maximum likelihood estimator and its corresponding interpolation coefficient with specific human model class center matrix calculates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031022065A CN1221939C (en) | 2003-01-27 | 2003-01-27 | Speaker self-adaptive method in speech recognition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031022065A CN1221939C (en) | 2003-01-27 | 2003-01-27 | Speaker self-adaptive method in speech recognition system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1521728A CN1521728A (en) | 2004-08-18 |
CN1221939C true CN1221939C (en) | 2005-10-05 |
Family
ID=34281633
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB031022065A Expired - Fee Related CN1221939C (en) | 2003-01-27 | 2003-01-27 | Speaker self-adaptive method in speech recognition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1221939C (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123648B (en) * | 2006-08-11 | 2010-05-12 | 中国科学院声学研究所 | Self-adapted method in phone voice recognition |
-
2003
- 2003-01-27 CN CNB031022065A patent/CN1221939C/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN1521728A (en) | 2004-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6879954B2 (en) | Pattern matching for large vocabulary speech recognition systems | |
KR101113006B1 (en) | Apparatus and method for clustering using mutual information between clusters | |
CN1302456C (en) | Sound veins identifying method | |
CN108922543B (en) | Model base establishing method, voice recognition method, device, equipment and medium | |
CN113488060B (en) | Voiceprint recognition method and system based on variation information bottleneck | |
CN1234110C (en) | Noise adaptation system of speech model, noise adaptation method, and noise adaptation program for speech recognition | |
CN109616105A (en) | A kind of noisy speech recognition methods based on transfer learning | |
US9984678B2 (en) | Factored transforms for separable adaptation of acoustic models | |
CN112530407B (en) | Language identification method and system | |
Maas et al. | Recurrent neural network feature enhancement: The 2nd CHiME challenge | |
CN112270405A (en) | Filter pruning method and system of convolution neural network model based on norm | |
CN105895104B (en) | Speaker adaptation recognition methods and system | |
CN1534596A (en) | Method and device for resonance peak tracing using residuum model | |
CN1514432A (en) | Dynamic time curving system and method based on Gauss model in speech processing | |
CN113591733B (en) | Underwater acoustic communication modulation mode classification identification method based on integrated neural network model | |
CN1221939C (en) | Speaker self-adaptive method in speech recognition system | |
CN114943335A (en) | Layer-by-layer optimization method of ternary neural network | |
CN1787077A (en) | Method for fast identifying speeking person based on comparing ordinal number of archor model space projection | |
CN1198261C (en) | Voice identification based on decision tree | |
CN1870136A (en) | Variation Bayesian voice strengthening method based on voice generating model | |
CN115348215B (en) | Encryption network traffic classification method based on space-time attention mechanism | |
CN1221938C (en) | Speaker self-adaptive method based on Gauss similarity analysis | |
CN106373576B (en) | Speaker confirmation method and system based on VQ and SVM algorithms | |
CN1201287C (en) | Hiaden Markov model edge decipher data reconstitution method f speech sound identification | |
Hashimoto et al. | Bayesian context clustering using cross valid prior distribution for HMM-based speech recognition. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C19 | Lapse of patent right due to non-payment of the annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |