CN1295676C

CN1295676C - State structure regulating method in sound identification

Info

Publication number: CN1295676C
Application number: CNB2004100667929A
Authority: CN
Inventors: 朱杰; 徐向华
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2004-09-29
Filing date: 2004-09-29
Publication date: 2007-01-17
Anticipated expiration: 2024-09-29
Also published as: CN1588536A

Abstract

The present invention relates to a state structure regulating method in voice recognition in the field of voice recognition. The present invention comprises the steps that in the step of creation of a continuous voice recognition system with a large vocabulary, voice characteristics use 13 dimensions which comprise 12 levels of cepstrum and short time energy as basic characteristics, and first-order differences and second-order differences are added to form 39 characteristic dimensions; in the step of state structure regulation, self-adaptive voice and training voice are used to adjust the state structure of a model, and a hypothesis that an error generated when a baseline system recognizes the training voice can be also generated when the baseline system recognizes testing voice is made so as to utilize training voice material to adjust the structure of a residual state; in the step of speaker adaptation, self-adaptive voice material is used to process the adjusted model in the mode of self-adaptation by using maximum likelihood linear regression algorithm. The present invention increases the posteriori check probability of the model to samples and increases the utilization ratio of the self-adaptive voice material; thus, the problem of low recognition rate because a training voice material decision tree does not match with a testing voice material decision tree in structure.

Description

Status architecture method of adjustment in a kind of speech recognition

Technical field

The present invention relates to a kind of status architecture adjustment algorithm of field of speech recognition, specifically is the status architecture method of adjustment in a kind of speech recognition.

Background technology

Since the nineties, unspecified person (SI), large vocabulary continuous speech recognition (LVCSR) based on continuous probability HMM obtain a very large progress, for setting up more precise analytic model, the LVCSR system generally all adopts context-sensitive three-tone model, utilizes the performance of further improving model based on the state sharing policy of acoustics decision tree.Simultaneously, in the SI system, different speakers' property difference can bring the reduction of system performance, and this makes the speaker adaptation technology become the key that the SI system moves towards practicability.Adaptive approach commonly used comprises Bayes's (MAP) method and linear (MLLR) method that returns of maximum likelihood, all is based on the self-adaptation language material parameter of model is done conversion, does not have to consider the structure of decision tree is done self-adaptation.Merging in the decision tree between the state or division are based on that the variation of likelihood value in the corpus and sampled data output carry out, the structure of the decision tree that obtains can not reflect the feature of testing material effectively, especially when the characteristic difference of corpus and self-adaptation language material was bigger, the deviation of this structure directly can cause the reduction of system performance.

In order to solve the do not match reduction of the discrimination that causes of corpus decision tree and testing material decision tree structure, must the structure of corpus decision tree be adjusted, because after directly adjusting the corpus decision tree structure, can make the inconsistent of decision tree structure and corpus again, cause model accuracy to descend.

Find by literature search, A.Nakamura is in international acoustics, voice and signal Processing meeting (" ICASSP ", vol.1, pp.649-652,1998) propose to adjust Gaussian Mixture distribution function method among " a kind of method of in the unspecified person Acoustic Modeling, adjusting the Gaussian Mixture function structure " (the Restructuring Gaussian mixture density functions in speakerindependent acoustic models) that delivers in, in this scheme, for given voice X, t observation vector o constantly _t, corresponding actual Gaussian function is f _t ^a(μ, δ ²), belong to state s _a, and the Gaussian function of the identification that obtains by Viterbi (Viterbi) decoding algorithm is f _t ^b(μ, δ ²), belong to state s _bs _aWith s _bShare Gaussian function f _t ^b(μ, δ ²), thereby adjust s _aThe distribution function of middle Gaussian Mixture.Adjusted state comprises the Gaussian function of varying number, and certain Gaussian function can be shared by a plurality of states.Yet the training process of this method more at random, and this is based on corpus, can not reflect the information of tested speech to a certain extent.

Summary of the invention

The present invention is directed to above shortcomings and defective in the prior art, status architecture in a kind of speech recognition method of adjustment is provided, make it improve the posterior probability of model to sample, enhancing is to the utilization factor of self-adaptation language material, and increase state confidential reference items quantity, enlarge the description power of model, limited to the increase of system's Headquarters of the General Staff quantity, thus reduce the do not match reduction of the discrimination that causes of corpus and testing material decision tree structure.

The present invention is achieved by the following technical solutions, according to degree of obscuring between state, adopts and to obscure between state Gauss's weighting and share status architecture is adjusted, and concrete steps are as follows:

(1) set up large vocabulary continuous speech recognition system: phonetic feature adopts 12 rank Mel cepstrum features and short-time energy totally 13 to tie up as essential characteristic, adds its first order difference and second order difference, and last intrinsic dimensionality is 39, and process is with general speech recognition.Extract the feature of every words of training utterance, utilize HTK (HMMToolKit) instrument at first to select initial consonant and band to transfer simple or compound vowel of a Chinese syllable, set up band and transfer the single-tone submodel as basic modeling unit according to the sentence content; Then model is expanded to context-sensitive three-tone model by single-tone, the three-tone model has been considered different inter-syllables left and right sides sound mother's situation simultaneously, the three-tone model that different linguistic context is corresponding different; Utilize the acoustics decision tree that the state based on all three-tone models of same single-tone is carried out cluster at last, the state after the cluster expands to a plurality of mixed Gaussians gradually by single Gaussian distribution and distributes.

(2) status architecture adjustment: comprise and utilize adaptive voice to the model state structural adjustment with utilize training utterance to the model state structural adjustment.Adaptive voice and tested speech are from same tester, and the wrong meeting equally that occurs during baseline system identification adaptive voice occur when baseline system identification tested speech.Therefore, the mistake that occurs when as analysed basis wire system identification adaptive voice goes out is carried out suitable adjustment to status architecture and not only can be improved utilization factor to the self-adaptation language material, can also improve the posterior probability of model.On the other hand, only utilize the self-adaptation language material, being limited in scope of state adjustment to the status architecture adjustment; Corpus is from a large amount of speakers, and pronunciation has certain representativeness.Therefore suppose that the mistake that baseline system occurs also can occur when the recognition training voice, thereby can utilize corpus that the status architecture that does not have in the adaptive voice to occur is adjusted when the identification tested speech.

(3) speaker adaptation: adopt the linear regression algorithm (MLLR) of maximum likelihood, utilize the self-adaptation language material that adjusted model is done self-adaptation, purpose is not matching between adjusted model of further compensating coefficient and the tested speech.

Below the present invention is further illustrated, particular content is as follows:

1, the described adaptive voice that utilizes is to the model state structural adjustment, and concrete steps are:

If the state set of HMMs is Ω; Self-adaptation sample X={X ₁..., X _i...) and corresponding state set is Φ.Each sample X _iThe characteristic of correspondence vector is O _i=(o ₁..., o _t..., o _T), state set is Φ _i(Φ _i Φ).According to sample X _iAcoustic model, utilize frame synchronization Viterbi algorithm to obtain vector O _iCorresponding to Φ _iStatus switch Ξ=(s ₁..., s _t..., s _T), claim that huge is actual status switch; Similarly obtain O according to the Viterbi recognizer _iStatus switch Ψ=(r corresponding to state set Ω ₁..., r _t..., r _T), claim Ψ status switch for identification.Relatively these two groups of status switches obtain corresponding to same vector o _tTwo state s _tAnd r _t, if s _t≠ r _t, claim r _tBe S _tObscure state, define both degree of obscuring (confusion):

C_{s_{t} | r_{t}} = \frac{P (o_{t} | r_{t})}{P (o_{t} | s_{t})} - - - (1)

Because state s _tKnown into r by mistake _tSo, work as s _t≠ r _t, ignore language model and state sound transition probability, P (o is arranged _t| r _t)＞P (o _ts _t), promptly

C_{s_{t} | r_{t}} > 1,

From definition (1) as can be seen, C _St|rtBig more, virtual condition S is described _tBe identified as r _tPossibility big more.Therefore, if state r _tMixed Gaussian with the form and the state s of weighting _tShare, change state s _tStructure, probability P (o then _t| s _t) understand increase, thus the misclassification rate of system can be reduced, improve model to observing vector o _tPosterior probability.

If state s ∈ is Φ, corresponding to the observation eigenvector O of self-adaptation sample _sR _sBe identification O _sThe state set that obtains (Rs  Ω) claims R _sClose state set for s.Utilize state r (r ∈ R _s), the s structure to be adjusted, adjusted Gaussian Mixture function is

b (\cdot | s) = \underset{r &Element; R^{s}}{Σ} w_{s | r} P (\cdot | r) + w_{0} P (\cdot | s) - - - (2)

In the formula (2), get w ₀=1-D, D are constant; Weights W _S|rAnd the computing formula of probability function P (lr) is respectively

w_{s | r} = D \cdot \frac{C_{s | r}}{\underset{r {&Element; R}^{s}}{Σ} C_{s | r}} - - - (3)

P (\cdot | r) = Σ_{l = 1}^{L} m_{r, 1} N ({\cdot | μ}_{r, 1}, Σ_{r, 1}) - - - (4)

(4) L is that state is adjusted preceding Gaussian Mixture number, μ in the formula _{R, 1}, ∑ _{R, 1}And m _{R, 1}Be respectively polynary Gaussian function N (| μ _{R, 1}, ∑ _{R, 1}) mean value vector, diagonal covariance matrix and weights.Therefore, there are two-layer weights in the state after the structural adjustment: weights m in the state _{R, 1}And weight w between state _S|r, satisfy

Weights in the state:

Σ_{k = 1}^{K} m_{r, k} = 1,

0≤m _r，k≤1.

Weights between state:

\underset{r {&Element; R}^{s^{'}}}{Σ} w_{s | r} = 1,

0≤w _S|r≤ 1, R wherein ^{S '}=R ^s∪ s.

2, the described training utterance that utilizes is to the model state structural adjustment, and concrete steps are:

If the state before adjusting is s, the log-likelihood value is

L {(O_{s})}^{'} = Σ_{o {&Element; O}_{s}} \log (P (o | \overset{&OverBar;}{s})),

Adjust the increase of back likelihood value: Δ L (O _s)=L (O _s)-L (O _sThe average likelihood value of) ', state set Φ correspondence increases to:

ΔL = \frac{1}{size (Φ)} \underset{s &Element; Φ}{Σ} ΔL (O_{s}),

Δ L will use in adjusting based on the status architecture of training utterance as threshold value.

Definition status collection Ψ (Ψ=Ω-Φ), utilize corpus that the model state structure is done further adjustment, concrete steps are:

1) to training sample Y _i(Y _i∈ Y) and characteristic of correspondence vector O _i, obtain status recognition sequence { η } after the identification of employing Viterbi decoding algorithm _iAccording to Y _iCorresponding acoustic model adopts Viterbi frame synchronization to the observation sequence segmentation, obtains corresponding to eigenvector O _iVirtual condition sequence { γ } _i

2) repeating step 1), finish operation to all training sample Y, obtain two class status switches { η } ({ η } _i { η }) and { γ } ({ γ } _i { γ }).

3) compare { η } and { γ }, determine the close state set R of state s (s ∈ { γ }) _s(Rs  { η }); Computing mode r ∈ R _sDegree of obscuring C with state s _S|rAccording to the size of degree of obscuring, with state set R _sThe descending arrangement of element, and establish state set R _sSize be I _s

4) to the adjustment of state s: get preceding i (0＜i＜I _s) individual state adjusts s, calculates the increase Δ Ls of likelihood value.If Δ Ls＜Δ L gets i=i+1, up to Δ Ls＞Δ L; If work as i=I _sThe time, still have Δ L _s＜Δ L does not then adjust to state s.

5) repeating step 3)～4) structural adjustment of each state in finishing Ψ.

To weight w between the state that increases _S|rRevaluation, the objective function of use is:

L (O_{s}) = Σ_{o {&Element; O}_{s}} \log (P (o | s))

= Σ_{o {&Element; O}_{s}} \log Σ_{r {&Element; R}^{s^{'}}} w_{s | r} P (o | r)

(5)

Weight w when asking objective function maximum _S|rThe time, adopting maximum (EM) algorithm of expectation, auxiliary function is:

Q(w _s|r， w _s|r)＝E[log P(O _s，s| w _s|r)|O _s，w _s|r] (6)

\underset{r {&Element; R}^{s^{'}}}{Σ} w_{s | r} = 1

Under the condition, following formula is to w _S|rDifferentiate,

{\overset{&OverBar;}{w}}_{s | r} = \frac{\underset{o {&Element; O}_{s}}{Σ} Σ_{k = 1}^{K} γ (s, r, k)}{\underset{o {&Element; O}_{s}}{Σ} \underset{r {&Element; R}^{s^{'}}}{Σ} Σ_{k = 1}^{K} γ (s, r, k)} - - - (7)

Here

γ (s, r, k) = \frac{w_{s | r} m_{r, k} N (o | μ_{r, k}, δ_{r, k})}{\underset{r {&Element; R}^{s^{'}}}{Σ} Σ_{k = 1}^{K} w_{s | r} m_{r, k} N (o | μ_{r, k}, δ_{r, k})},

For observing o (o ∈ O _s) belong to the probability of k mixed Gaussian among the state r.w _S|rBe exactly to w _S|rUpdating value.

When utilizing the MLLR algorithm that the adjusted model of state is done self-adaptation, consider the finiteness of self-adaptation language material, only the average of model is done self-adaptation, all the other parameters remain unchanged; Translation matrix in the MLLR algorithm adopts the diagonal angle translation matrix, and shares translation matrix between the different target average.The estimation of diagonal angle translation matrix is to utilize all self-adapting datas of sharing the target distribution correspondence, and shared degree and scope are adjusted according to what and phonetics classification of self-adapting data.

The Gaussian Mixture function is shared between the state that the present invention easily obscures, because the identification error that training utterance and tested speech decision tree structure do not match and cause, the state that can occur when discerning adaptive voice embodies obscuring.For example, with female voice Model Identification male voice voice, (during B ≠ A), wherein most applications is that A and B belong to same decision tree, and some situation is that A and B exactly belong to same leaf node in the male voice decision tree when state A is identified as state B.Therefore, the method that the present invention at first adopts adaptive voice that status architecture is adjusted, the scope of utilizing the training utterance expanded state to adjust then on this basis again.

The present invention has improved the posterior probability of model to sample, enhancing is to the utilization factor of self-adaptation language material, and increase state confidential reference items quantity, enlarge the description power of model, increase to system's Headquarters of the General Staff quantity is limited, thereby reduces the do not match reduction of the discrimination that causes of corpus and testing material decision tree structure.Need to prove that protection scope of the present invention is not subjected to the restriction of modeling unit size and quantity, also be not subjected to the restriction of types of models, its method is applicable to any other continuous speech recognition system.

Description of drawings

Fig. 1: status architecture adjustment and speaker adaptation

Fig. 2: based on the status architecture adjustment of corpus

Fig. 3: status architecture Adjustment System performance relatively

Fig. 4: status architecture Adjustment System speaker adaptation performance relatively

Embodiment

Content in conjunction with the inventive method provides following examples that it is further understood.

Embodiment:

For understanding technical scheme of the present invention better, adopt the continuous speech database to do experiment and further specify.The training set of baseline system F_863 comprises that F_Tr comprises 68 female voice recording, everyone about 530 words, totally 36210; Voice adopt 16KHz sampling rate, 16 samplings, frame length 25ms, frames to move and be 10ms.Extract 39 dimension speech characteristic vectors, comprise 12 dimension MFCC, 1 dimension normalized energy, and their single order, second order difference.Acoustic model selects initial consonant and band to transfer simple or compound vowel of a Chinese syllable as basic modeling unit, each modeling unit all uses the HMM of continuous density to represent, in the present invention, basic modeling unit sees Table 1 (rhythm imperial mother's digitized representation tone, numeral 5 representatives softly), comprise 27 of initial consonants, wherein ga, ge, ger, go are respectively the supposition initial consonant of single syllable a, e, er, o; Band is transferred 157 of simple or compound vowel of a Chinese syllable, and wherein ib is illustrated in the simple or compound vowel of a Chinese syllable among syllable chi, ri, shi and the zhi, the simple or compound vowel of a Chinese syllable that the if representative is used in syllable ci, si and zi.Add quiet (silence) HMM model, train 185 single-tone submodels altogether, the training method of model is with general speech recognition process.After training pattern expands to three-tone by single-tone, based on the acoustics decision tree, the three-tone model is done state clustering, the distributions after the cluster expands to 8 mixed Gaussians gradually by single Gauss, system does not have the applicational language model in identifying, experiment only is the result on the acoustic layer.

Initial consonant in table 1 acoustic model and band are transferred simple or compound vowel of a Chinese syllable

Initial consonant (initial)	b，c，ch，d，f，g，ga，ge，ger，go，h，j，k，l，m，n，p，q，r，s，sh，t， w，x，y，z，zh
Initial consonant (initial)		Band is transferred simple or compound vowel of a Chinese syllable (tonal final)	a(1-5)，ai(1-4)，an(1-4)，ang(1-5)，ao(1-4)，e(1-5)，ei(1-4)， en(1-5)，eng(1-4)，er(2-4)，i(1-5)，ia(1-4)，ib(1-4)，ian(1-5)， iang(1-4)，iao(1-4)，ie(1-4)，if(1-4)，in(1-4)，ing(1-4)，iong(1-3)， iu(1-5)，o(1-5)，ong(1-4)，ou(1-5)，u(1-5)，ua(1-4)，uai(1-4)， uan(1-4)，uang(1-4)，ui(1-4)，un(1-4)，uo(1-5)，v(1-4)，van(1-4)， ve(1-4)，vn(1-4)

Male voice testing material M_Te is from 14 people, everyone 40 words; Male voice self-adaptation language material M_Ad is from 14 same testers, and everyone 40 words are independently between tested speech and the adaptive voice wherein.Utilize M_Ad that F_863 is made the adjusted model of status architecture and be designated as R1_F, on the basis of R1_F, utilize F_Tr to make further adjusted model and be designated as R2_F, with the variation of self-adaptation statement quantity, system performance more as shown in Figure 3.As can be seen from Figure 3, R1_F has obtained the consistent discrimination that improves than F_863 with R2_F.When the self-adaptation language material more after a little while, when for example having only 1,3, the number of states of structural adjustment is limited among the R1_F, the raising of its performance also is limited; And utilize corpus that the performance of the R2_F of the state adjustment that do not have in the adaptive voice to occur is significantly improved, thereby the hypothesis that explanation is done when utilizing corpus that status architecture is adjusted is set up.Along with the increase of self-adaptation statement, R1_F and R2_F performance begin approaching, and when the self-adaptation language material was abundant, R1_F and R2_F will be consistent.

Utilize the male voice adaptive voice to do the MLLR speaker adaptation to above F_863, R1_F and three systems of R2_F, the discrimination of F_863/MLLR, F_R1/MLLR and F_R2/MLLR with the situation of change of self-adaptation sentence number as shown in Figure 4.Discrimination can be significantly increased after the MLLR self-adaptation was done by the system that the parameter amount is many, compare the F_863 system, adjusted F_R1 of state and F_R2 system have not only increased the parameter amount in the state greatly, and indirectly decision tree structure is adjusted from the angle of adjusting status architecture, having reduced decision tree structure and tested speech does not match to the influence of speaker adaptation, so F_R1/MLLR, the recognition performance of F_R2/MLLR is apparently higher than F_863/MLLR, thereby proved that the state adjustment algorithm helps improving the performance of system.

Claims

1, the status architecture method of adjustment in a kind of speech recognition is characterized in that, according to degree of obscuring between state, adopts and to obscure between state Gauss's weighting and share status architecture is adjusted, and concrete steps are as follows:

(1) set up large vocabulary continuous speech recognition system: phonetic feature adopts 12 rank Mel cepstrum features and short-time energy totally 13 to tie up as essential characteristic, add its first order difference and second order difference, last intrinsic dimensionality is 39, process is with general speech recognition, extract the feature of every words of training utterance, utilize the HTK instrument at first to select initial consonant and band to transfer simple or compound vowel of a Chinese syllable according to the sentence content, set up band and transfer the single-tone submodel as basic modeling unit; Then model is expanded to context-sensitive three-tone model by single-tone, the three-tone model has been considered inter-syllable left and right sides sound mother's situation simultaneously, and linguistic context is corresponding with the three-tone model; Utilize the acoustics decision tree that the state based on all three-tone models of same single-tone is carried out cluster at last, the state after the cluster expands to a plurality of mixed Gaussians gradually by single Gaussian distribution and distributes;

(2) status architecture adjustment: comprise and utilize adaptive voice to the model state structural adjustment with utilize training utterance to the model state structural adjustment, adaptive voice and tested speech are from same tester, the wrong meeting equally that occurs during baseline system identification adaptive voice occur when baseline system identification tested speech, therefore, suppose that the mistake that baseline system occurs also can occur when the recognition training voice, thereby utilize corpus that the structure of the state that not have appearance in the adaptive voice is adjusted when the identification tested speech;

(3) speaker adaptation: adopt the linear regression algorithm of maximum likelihood, utilize the self-adaptation language material that adjusted model is done self-adaptation.

2, status architecture method of adjustment in the speech recognition according to claim 1, it is characterized in that, when the linear regression algorithm of described maximum likelihood is done self-adaptation to the adjusted model of state, consider the finiteness of self-adaptation language material, only the average of model is done self-adaptation, translation matrix in the linear regression algorithm of maximum likelihood adopts the diagonal angle translation matrix, and between plural target mean, share translation matrix, the estimation of diagonal angle translation matrix is to utilize all self-adapting datas of sharing the target distribution correspondence, and shared degree and scope are adjusted according to what and phonetics classification of self-adapting data.