CN1655232A

CN1655232A - Context-sensitive Chinese speech recognition modeling method

Info

Publication number: CN1655232A
Application number: CNA2004100041313A
Authority: CN
Inventors: 贾磊; 马龙
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2004-02-13
Filing date: 2004-02-13
Publication date: 2005-08-17
Anticipated expiration: 2024-02-13
Also published as: CN1655232B

Abstract

This invention relates to context-dependent Chinese phone identifying and modeling method, which applies initial consonant right-dependent and final sound left dependent modeling method including: a, creating a context-dependent basic modeling unit by relating the initial consonant with the adjacent right final sound and relating the final sound with its adjacent left initial consonant, b, utilizing the state clustering method to train the model parameters to get an initial HMM, c, utilizing the sub-space clustering method to compress the HMM to generate a final model.

Description

Context-sensitive Chinese speech identification modeling method

Technical field

The present invention relates to a kind of speech recognition modeling method, particularly can be applied to the context-sensitive Chinese Acoustic Modeling method of embedded device.

Background technology

Speech recognition technology is to allow machine voice signal be changed into the technology of corresponding text or order by identification and understanding process.Speech recognition technology combines with speech synthesis technique and can make people can get rid of keyboard, operates by voice command, carries out speech exchange with machine.Recent two decades comes, and along with fast development of computer technology, speech recognition technology is obtained marked improvement, begins to move towards market from the laboratory.People estimate that in following 10 years, speech recognition technology will enter every field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product.

But, in present stage, move towards in the process of practical application at speech recognition technology, also exist the contradiction between computing power, storage capacity and the speech recognition system discrimination of computing machine.How carrying out high-precision Acoustic Modeling on the limited embedded device of internal memory is a guardian technique problem that is related to speech recognition system practicability.

The open CN1264468A of Chinese patent has disclosed a kind of dictation system that phonetic entry is transformed into the computing machine execution of literal.This system utilizes literal to produce the spoken translation of a given word to the structure of voice, and the spoken translation of this given word exports on sound device, and how a given word will pronounce so that the user of speech recognition system knows this speech recognition expectation.

The open CN1288225A of Chinese patent has disclosed a kind of speech recognition system and speech recognition controlled method.The technical scheme of this invention is the voice of storing in advance in prerecorded speech recognition table as the operator of expectation value.When the terminal electronic equipment of any uncheck is connected to control device, control device is deposited the speech recognition table that electronic equipment provides, and when the operator imports voice, control the I/O of electronic equipment by control device relatively, then according to the comparative result of operator's voice and speech recognition table the speech recognition table of operator's voice and the electronic equipment deposited in advance.

Adopt context-sensitive phoneme modeling method in the speech recognition system that is disclosed among the open CN1264468A of Chinese patent, though the acoustic model of Jian Liing has higher precision like this, but the volume ratio of model is bigger, be difficult to directly be encased in the internal memory of embedded device, be difficult to satisfy the practical application needs of embedded device.

The problem that exists in the above-mentioned publication is that its required internal memory is bigger, is not suitable for using in embedded device.

Summary of the invention

The present invention proposes right relevant, the left relevant context dependent phoneme modeling method of simple or compound vowel of a Chinese syllable of a kind of initial consonant based on state clustering in conjunction with the characteristics of Chinese.The acoustic model that uses this modeling method to train out has higher model accuracy and less model volume, is particularly suitable for the use occasion of the less embedded device of some internal memories.For the volume of further compressed acoustic model, avoid the precision of loss model simultaneously as far as possible, to guarantee the performance of speech recognition system in embedded device, the present invention adopts the subspace clustering algorithm that acoustic model is compressed.Substantially do not losing under the prerequisite of speech recognition system performance, this method can be compressed to 1/10～1/5 of original model size to the acoustic model of speech recognition system.

The purpose of this invention is to provide a kind of context-sensitive Chinese Acoustic Modeling method that is applicable to embedded device, this method can make the likelihood probability loss of all training samples minimum in the initial model training process, in the compression process of model without any need for the participation of corpus, reduction that thus can convenient and swift implementation model scale, and guarantee that the precision of model does not have big loss.

The invention provides a kind of context-sensitive Chinese Acoustic Modeling method that is applicable to embedded device, comprise step: (a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit; (b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; (c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.

On the other hand, the present invention can provide a kind of computer readable recording medium storing program for performing, be used to store the program of carrying out context-sensitive Chinese speech identification modeling method, (a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit; (b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; (c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.

The present invention mainly is the characteristics at Chinese speech pronunciation, proposes right relevant, the left relevant semitone joint of the simple or compound vowel of a Chinese syllable Diphone modeling method of a kind of initial consonant; With the likelihood probability of all training samples loss minimum is target, and the state output of adopting the method for state clustering to carry out the Diphone model distributes shares and the training of model parameter; Adopt the subspace clustering algorithm to carry out the acoustic model compression.Compress the special requirement of this application at model, that at first adopts the LBG algorithm carries out cluster to generate initial Gauss's code book to original Gauss model, adopts the K-Means clustering algorithm that Gauss's code book is optimized then, generates final Gauss's code book.The major advantage of the method is without any need for the participation of corpus in the compression process of model.

Description of drawings

To make the features and advantages of the present invention become clearer by reading instructions with reference to the accompanying drawings, wherein:

Fig. 1 is the process flow diagram according to the model training of the Chinese Acoustic Modeling of the embodiment of the invention;

Fig. 2 is the topological structure of expression hidden Markov commonly used;

Fig. 3 shares algorithm flow chart according to the output distribution based on state clustering of the embodiment of the invention;

Fig. 4 is that synoptic diagram is shared in expression hidden Markov state clustering output distribution; With

Fig. 5 is that expression is according to the model compression process figure based on subspace clustering of the present invention.

Embodiment

The ultimate principle of speech recognition at first is described below.

Speech recognition comprises two basic processes, i.e. training process and identifying.The main task of training process is to utilize a large amount of voice training samples, sets up acoustic model, is used to describe acoustic layer knowledge.In the recognition system of complexity, also need to utilize a large amount of corpus of text, train language model is used for descriptive language layer knowledge.In identifying, utilize the acoustic model and the language model that obtain in the training process, speech samples to be measured is decoded, it is identified as text.The technological innovation that this patent is described mainly concentrates on the acoustic training model process of training stage.

Chinese has self unique language characteristic as a kind of language, utilizes these characteristics to carry out the acoustic model modeling, can improve the performance of model to greatest extent when reducing the model volume.

Following Chinese and common comparison of western language English do.The most significant difference is that Chinese is a kind of pictograph, and English is a kind of alphabetic writing.The linguistic unit of the minimum in the English is a speech, the creation that speech can not stopped, and the number of speech is also in the variation that does not stop.When new things or new ideas occurring, all need to create again neologisms usually and describe it.Aspect pronunciation, the pronunciation of each speech of English is the pronunciation that has been interconnected to constitute whole speech by the syllable of some, and the coupling between the different syllables is very strong.And the minimum component unit of Chinese is a word, and different things and notion described in these words formation speech that can separately or combine with one another.As seen, " word " is that Chinese substantially the most also is relatively independent formation unit, and the notion of this " word " is not have in the English.Aspect pronunciation, each word all is an independently syllable in the Chinese, and each syllable connects a simple or compound vowel of a Chinese syllable after by an initial consonant and forms.408 different syllables are formed in the pronunciation of all Chinese characters.Because Chinese is when word-building, word is relatively independent, so the independence between the syllable of its pronunciation is also stronger relatively.The present invention utilizes the relative independentability of the inter-syllable of Chinese to carry out high-precision acoustic feature modeling just.

Fig. 1 is the overview flow chart of expression acoustic training model process.At first,, select basic modeling unit, and define the context coupled relation between the basic modeling unit at step S11.Then, utilize the method for state clustering, utilize the voice training data, the parameter of hidden Markov model is trained, obtain initial HMM at step S12.Next, utilize the subspace clustering algorithm that initial model is compressed, and obtain final model at step S13.

Below with reference to accompanying drawing 1, to shown in the details of each step in the process flow diagram specifically describe.

1. select basic modeling unit

Before beginning to train acoustic model, must at first define basic modeling unit, just determine the granularity of each model description.Basic modeling unit can have multiple choices in the speech recognition, can be as basic modeling unit such as phoneme, semitone joint, syllable or speech etc.As previously mentioned, each word of Chinese all is a syllable, and each syllable is made up of two parts of initial consonant and simple or compound vowel of a Chinese syllable, so Chinese speech recognition system is basic modeling unit with the sound mother mostly, is called semitone and saves modeling.We select equally, and sound is female to be basic modeling unit, and wherein initial consonant has 27, and simple or compound vowel of a Chinese syllable has 38, also have in addition one quiet.

2. define the context dependence between the basic modeling unit

So-called " context dependent " is meant, when pronouncing continuously, itself pronunciation is not only depended in the pronunciation of each elementary cell, and is also relevant with the residing linguistic context in this unit, in brief, is exactly relevant with the pronunciation of some other elementary cells before and after the active cell.For example, in " China (zh-ongg-uo) " and " central authorities (zh-ong y-ang) " two speech, because the guo that the back connect is different with the yang pronunciation, the pronunciation of so same elementary cell (zh) ong is different.

The front labor characteristics of Chinese language, the relative independentability when having pointed out the Chinese word-building between the syllable.The present invention utilizes the mutual independence between the syllable to define the context dependence of basic modeling unit just.We suppose that between the pronunciation of different syllables be separate, the pronunciation of initial consonant only and in being in same syllable the simple or compound vowel of a Chinese syllable on the right relevant, the pronunciation of simple or compound vowel of a Chinese syllable simultaneously be in same syllable in the initial consonant on the left side relevant.

This context dependence defines, and meets the characteristics of Chinese speech pronunciation, can portray the coarticulation phenomenon of syllable inside on the one hand more exactly, owing to only need the monolateral linguistic context of consideration, can significantly reduce the number of model on the other hand.By simple computation as can be known, if consider the linguistic context (being the triphone modeling) of the right and left simultaneously, total number of model is 27 * 39 * 39+38 * 28 * 28+1=70860 so, according to our correlativity definition (Diphone modeling just), total number of model drops to 27 * 38+38 * 27+1=2053.The pressure that next step model training and compression wait each step has been alleviated in model number reduction greatly, makes model be more suitable for use in embedded system.

3. lose as distance criterion with likelihood probability, adopt the method for state clustering to carry out the training of model parameter

Hidden Markov model (HMM) is the modeling method of main flow in the speech recognition.The topological structure of HMM from left to right commonly used as shown in Figure 2.Wherein each state arrays from left to right, can be between state self or the different conditions according to the sensing of arrow according to certain probability redirect, and a probability density function (pdf) is adhered in each state output.

Owing to directly adopt the model parameter estimation of all Markov models of existing training data set pair that certain difficulty is arranged, and such number of parameters of the model that comes out of training also can be many, the amount of ram that model takies is bigger, can not satisfy the needs that use in embedded device.Therefore need to adopt the way of state clustering, between the different conditions of different models, carry out state output distribution parameter and share, thereby minimizing model parameter number and assurance can obtain the parameter estimation of robust through the model parameter of state shared output distribution.

The process of state clustering algorithm is as follows:

At first, initialization is carried out in model state output distribution.Suppose that each HMM comprises three states, The initial segment, interlude and the ending segment of the voice observation sample of their respectively corresponding certain basic modeling unit.The feature space of pairing all these states of composition of sample of same state.In the time of initial, the gauss hybrid models of two mixing of use is described each state space of HMM.For the state of observation sample number less than certain threshold value, the fixing gauss hybrid models of its feature space user difference is described.

After this, the output of Share Model state distributes.At first define the context dependent phoneme model (all-phone) of certain specific basic modeling unit correspondence:, all b referred to as the context dependent phoneme model of this modeling unit owing to different all HMM that produce of the residing context of co-text of this element for certain specific modeling unit.For example, for initial consonant b, b-a, b-an, b-o, b-u etc. are the context dependent phoneme models of b.In the state clustering process, the different conditions that only is in same position in the context dependent phoneme model to same modeling unit launches cluster.

Specifically describe the flow process that distributes and share based on the output of state clustering of the present invention below in conjunction with Fig. 3.

At step S31, calculate the loss that merges the likelihood probability that is caused between any two states.The loss of likelihood probability is to calculate with following formula (1):

Dis = \underset{k, k &Element; C}{Σ} \log P (o_{k}) - \underset{k, k &Element; C_{1}}{Σ} \log P_{1} (o_{k}) - \underset{k, k &Element; C_{2}}{Σ} \log P_{2} (o_{k}) - - - - (1)

As previously mentioned, the feature space of each state all uses the gauss hybrid models of two mixing to describe, P (o _k) represent the observation probability density function on this feature space, o _kBe input observation sample vector.C ₁And C ₂Two state class before expression merges respectively, C represents by C ₁And C ₂Merge the later state class that is generated.

At step S32, all possible state that calculated from step S31 merges the merging of seeking two minimum state class of likelihood probability loss in the set.Then, at step S33, whether the number of samples of judging these two state class is greater than a fixing threshold value (M).If judged result is for affirming that flow process then forwards step S34 to, this merging is deleted, and turn back to step S32 after this from the set of above-mentioned merging.If the judged result of step S33 is for negating, promptly, if the number of samples that has a state class in these two state class at least is less than this fixing threshold value, then at step S35 these two state class have been merged and generate a new state class, the feature space of new state class is described with the gauss hybrid models of two mixing again.After this, judge that at step S36 whether the corresponding observation sample number of each state class is greater than another fixing threshold value.If the mixed Gauss model that each state output after the judged result of step S36 for affirming that treatment scheme then proceeds to step S37, adopts the K-Means clustering algorithm to be combined distributes carries out parameter estimation.On the contrary, if in the judged result of step S36 for negating, that is, the number of samples of at least one state is not more than this threshold value N, flow process then turns back to step S31, continues the likelihood probability loss of asking any two states to merge.

Fig. 4 represents is that the feature space of the state class that combines is shared identical output and distributed.In addition, in above-mentioned state clustering process, the Gaussian Mixture number of mixed Gauss model can be redefined for a fixing value, also can dynamically determine according to certain criterion (for example BIC criterion).

4. adopt the subspace clustering algorithm to carry out the compression of acoustic model

The volume of the acoustic model that employing state clustering method trains out is bigger usually, and directly being used in still has certain difficulty in the embedded system.Therefore the present invention has also comprised the model compress technique based on subspace clustering, utilizes this technology, substantially not under the prerequisite of loss system recognition performance, and can be the volume compression of acoustic model to 1/10～1/5 of original size.Describe the specific operation process of this model compress technique in detail below with reference to Fig. 5.

At first, in step S51 definition subspace.In speech recognition, be that identification and training all will be extracted characteristic parameter from voice, multi-C vector normally, the institute of these characteristic parameter vectors might value constitute an original multidimensional feature space.So-called " subspace " is meant that some feature dimensions that have correlativity most combine the feature space that is generated.The division of subspace can also can be combined a few dimensional features of related coefficient maximum in the related coefficient of obtaining on the primitive character space between each dimensional feature according to empirical method by artificial delimitation, constitutes a proper subspace.By the definition subspace, the primitive characteristics space is divided into the several features subspace.

Finish after the definition of subspace, original Gauss model is decomposed at step S52.Each Gauss model that is based upon in the gauss hybrid models on the primitive character space just can be expressed as the product that is based upon the sub-Gauss model on each sub spaces:

P (O) = Σ_{m}^{M} c_{m} (Π_{k = 1}^{K} N^{Tied} (O_{k}; μ_{mk}, σ_{mk}) - - - - (2)

Above formula (2) represented be exactly after the Subspace Decomposition through the primitive character space, P (O) computing formula that the model of the gauss hybrid models when the observation sample vector o input in primitive character space is given a mark.Wherein, M represents the mixed number of gauss hybrid models, and K represents the number of subspace.N ^Tied(O _kμ _Mk, σ _Mk) be the sub-Gauss model that is based upon on the k sub spaces.o _kBe among the observation sample vector o corresponding to the observation sample subvector of k sub spaces, parameter { μ _Mk, σ _MkBe the sub-Gauss model parameter of original Gauss model corresponding to the k sub spaces.

After this, flow process proceeds to step S53, and promptly the sub-Gauss model of all that on each sub spaces decomposition are obtained carries out cluster.Clustering result will generate Gauss's code book of some on each subspace, the final acoustic model of recognition system is generated by these Gauss's code book combinations.The cluster process of sub-Gauss model mainly upgrades these two links by the optimization of the initialization of Gauss's code book and Gauss's code book and constitutes.

At step S53, carry out the initialization procedure of Gauss's code book.Adopt the LBG algorithm in this course, original sub-Gauss models all on the subspace is carried out cluster, generate initial Gauss's code book.Concrete operations are as follows: at first generate Gauss's code book center, the Mean Parameters at this center can be obtained by the direct estimation in the average center of all sub-Gauss models, and its variance parameter can be obtained by the direct calculating mean value of the variance of all sub-Gauss models.Finish after the initialization at Gauss's code book center, just begin the process of repetition " dichotomy division-cluster-code book upgrades ".When carrying out the dichotomy division, adopt method of perturbation that each original Gauss's code book is split into two new Gauss's code books.After this, the code book that all training samples are obtained by division is cluster again, and the center of calculating each class, obtains new code book thus.The process of this " division-cluster-code book upgrade " repeats down always, reaches predetermined value up to the number of code book.In cluster process, when carrying out the reclassifying of sub-Gauss model, adopt following formula as distance criterion, weigh the distance of any one sub-Gauss model to Gauss's code book center.Common sub-Gauss model is divided in Gauss's class apart from its nearest Gauss center representative and goes.

D_{bhat} = \frac{1}{8} {(μ_{2} - μ_{1})}^{T} {[\frac{Σ_{1} + Σ_{2}}{2}]}^{- 1} {(μ_{2} - μ_{1})}^{T} + \frac{1}{2} \ln \frac{| \frac{Σ_{1} + Σ_{2}}{2} |}{\sqrt{| Σ_{1} Σ_{2} |}} - - - - (3)

Represented distance is Gauss model N{ μ in the formula (3) ₁∑ ₁To N{ μ ₂, ∑ ₂Between distance.

After the initialization of finishing Gauss's code book, also to be optimized renewal process to initial Gauss's code book at step S54.This process is exactly to adopt broad sense K-means clustering algorithm that Gauss's code book of each sub spaces is optimized again.Carry out repartitioning of sub-Gauss model according to formula (3) during optimization, and carry out the renewal at Gauss model code book center according to following formula (4) and (5):

μ = \frac{n_{1} μ_{1} + n_{2} μ_{2}}{n_{1} + n_{2}} - - - - (4)

Σ = \frac{n_{1} {(μ_{1} - μ)}^{T} (μ_{1} - μ) + n_{2} {(μ_{2} - μ)}^{T} (μ_{2} - μ)}{n_{1} + n_{2}} + \frac{n_{1} Σ_{1} + n_{2} Σ_{2}}{n_{1} + n_{2}} - - - - (5)

The mean variance parameter of the Gauss model after { μ, ∑ } expression is upgraded in the formula, { μ ₁, ∑ ₁And { μ ₂, ∑ ₂Expression participates in the model parameter of two sub-Gauss models of cluster, n ₁And n ₂Number of samples when being these two Gaussian distribution of estimation.

It should be noted that in each stage of model compression all do not have the raw tone training sample to participate in, thereby compare with subspace model compression method in the past, the method that the present invention proposes uses more flexible, more convenient.

The present invention is mainly used in the speech recognition system of some embedded devices.Its major function is when reducing the acoustic model volume, keeps the recognition performance of original speech recognition system, thereby even in the use occasion of the very little embedded device of memory size, makes the possibility that is configured to of high-precision speech recognition system.

To compare the model training method among the present invention and the performance of general model algorithm below, thereby validity of the present invention will be described.

1) continuous speech syllable identification

The fundamental purpose of this test is that the model that trained for the acoustic training model method in the patent of the present invention relatively and common acoustic training model method is in the difference aspect the model accuracy.

Training data is 83 male sex of 863 training datasets people that record.Test data is the tested speech data of standard testing male voice 240 word of 863 test sets.The Triphone model that acoustic model adopts Diphone model that the modeling method among the present invention trains and the common method based on decision tree to train respectively, test findings is as follows:

Table 1 Diphone model and Triphone model are when the continuous syllable identification

Model accuracy relatively

	Discrimination when syllable is discerned continuously
	Discrimination when syllable is discerned continuously	Diphone of the present invention	?????80.24％
Triphone commonly used	?????82.02％	Diphone of the present invention	?????80.24％

2) mobile phone name identification

The fundamental purpose of this test is in order to test the recognition performance of acoustic model in a name identification mission that acoustic training model algorithm proposed by the invention is trained.The Triphone model that acoustic model adopts Diphone model after overcompression that the modeling method in the patent of the present invention trains and the common method based on decision tree to train respectively.83 male sex that training data remains 863 training datasets people that records.And test data becomes 2500 names of 10 people that record under the laboratory environment.

Table 2 Diphone model and Triphone model are when name is discerned

Model accuracy relatively

	Discrimination during name identification
	Discrimination during name identification	Diphone of the present invention	?????96.90％
Triphone commonly used	?????97.00％	Diphone of the present invention	?????96.90％

3) the model volume relatively

Because in embedded device, very little of memory size.Therefore except the index of simple model accuracy, the size of model also is to weigh a comprehensive index of system performance.

In reaching table 2 as a result the time, the committed memory of Diphone model of the present invention size is 1/10 of Triphone model commonly used.

4) effect of state clustering in the structure of Diphone model

When generating the Diphone model, than Diphone model higher model accuracy is arranged based on decision tree based on the Diphone model of state clustering.Table 3 provides based on the Diphone system of state clustering with based on the Diphone system of the decision tree discrimination in syllable identification continuously:

Table 3: based on the Diphone of state clustering with based on decision tree

The model accuracy of Diphone relatively

	Discrimination when syllable is discerned continuously
	Discrimination when syllable is discerned continuously	Diphone based on state clustering	?????80.24％
Triphone based on decision tree	?????79.01％	Diphone based on state clustering	?????80.24％

5) based on the model compress technique of subspace clustering in the contribution that reduces aspect the internal memory

If every bidimensional characteristics combination constitutes a proper subspace together, whole feature space can be divided into D/2 proper subspace so, and D represents the dimension in primitive character space here.Suppose that each proper subspace adopts 125 sub-Gauss model code books, the size of so whole acoustic model can be reduced to original 1/5-1/10.The model accuracy loss of so this acoustic model after overcompression can draw from the test findings of following table:

Table 4: based on the influence of the model compress technique of subspace clustering to system performance

	The discrimination of continuous syllable identification	The discrimination of name identification
	The discrimination of continuous syllable identification	The discrimination of name identification	Diphone without overcompression	?????80.24％	?????96.90％
Diphone through overcompression	?????78.34％	?????96.87％	Diphone without overcompression	?????80.24％	?????96.90％

From above test figure as can be seen, adopt Acoustic Modeling method of the present invention to compare with the triphone that uses always, the discrimination when syllable is discerned continuously only has small reduction, but its volume reduces greatly.In addition, by data cited in top table 3 and 4 as can be seen, cluster and model compress technique that the present invention adopts almost can be ignored the influence of speech recognition accuracy rate, and the volume of its acoustic model only is original 1/5-1/10, has greatly saved storage space.Be fit to very much be applied to such as internal memory small device such as embedded devices.

Above-described method can realize by hardware or software.The program of carrying out this method can be recorded in such as floppy disk, hard disk, and CD-ROM is on the computer-readable recording medium of DVD-ROM and so on.

Though invention has been described with reference to certain embodiments, the present invention is not limited thereto, and only be defined by the following claims, and those skilled in the art can carry out various changes and improvements to embodiments of the invention under the situation that does not break away from spirit of the present invention.

Claims

1. a Chinese speech is discerned modeling method, comprises step:

(a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit;

(b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; With

(c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.

2. wherein said Chinese speech identification modeling method according to claim 1, step (b) further comprises step:

(b1) calculate the loss that merges the likelihood probability that is caused between any two states;

(b2) all possible state that calculated from step b1 merges the merging of seeking two minimum state class of likelihood probability loss in the set;

(b3) judge that whether the number of samples of these two state class is greater than a fixing threshold value;

(b4), this merging is deleted from the set of above-mentioned merging if the judgement sample number is greater than a fixing threshold value in step (b3); If the number of samples that has a state class in these two state class at least is less than this fixing threshold value, then these two state class have been merged and generate a new state class, the feature space of new state class is described with the gauss hybrid models of two mixing again;

(b5) whether the number of samples of judging each state class greater than another fixing threshold value, if greater than described another fixed threshold, the mixed Gauss model that each state output after then adopting the K-Means clustering algorithm to be combined distributes carries out parameter estimation; If the number of samples of at least one state is not more than described another fixed threshold, then turn back to step (b1).

3. Chinese speech identification modeling method according to claim 2, the Gaussian Mixture number of wherein said mixed Gauss model can be redefined for a fixing value, also can dynamically determine.

4. Chinese speech identification modeling method according to claim 1, wherein step (c) employing LBG algorithm carries out cluster to generate initial Gauss's code book to master pattern.

5. Chinese speech identification modeling method according to claim 4, wherein step (c) comprises that also employing K-Means clustering algorithm is optimized and generates final Gauss's code book to Gauss's code book.

6. Chinese speech identification modeling method according to claim 1, wherein step (c) also comprises step:

Extract characteristic parameter from voice, combination has the feature dimensions of correlativity most with generated subspace;

Each Gauss model in the gauss hybrid models on the primitive character space is carried out Subspace Decomposition to obtain sub-Gauss model parameter;

The sub-Gauss model of all that on each sub spaces decomposition are obtained carries out cluster, so that generate the Gauss's code book with given number on each subspace;

Make up resulting Gauss's code book to generate final acoustic model.

7. Chinese speech identification modeling method according to claim 6, the cluster step of its neutron Gauss model also comprises carries out cluster to generate the step of initial Gauss's code book to all the original sub-Gauss models on the subspace.

8. discern modeling method according to the Chinese speech of claim described 6, the cluster step of its neutron Gauss model also is included in after the initialization of finishing Gauss's code book center, carries out the dichotomy division so that each original Gauss's code book is split into the step of two new Gauss's code books.

9. a computer readable recording medium storing program for performing is used to store the program of carrying out Chinese speech identification modeling method, and described method comprises step: