CN1655232A - Context-sensitive Chinese speech recognition modeling method - Google Patents

Context-sensitive Chinese speech recognition modeling method Download PDF

Info

Publication number
CN1655232A
CN1655232A CNA2004100041313A CN200410004131A CN1655232A CN 1655232 A CN1655232 A CN 1655232A CN A2004100041313 A CNA2004100041313 A CN A2004100041313A CN 200410004131 A CN200410004131 A CN 200410004131A CN 1655232 A CN1655232 A CN 1655232A
Authority
CN
China
Prior art keywords
model
gauss
chinese
state
modeling method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2004100041313A
Other languages
Chinese (zh)
Other versions
CN1655232B (en
Inventor
贾磊
马龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to CN2004100041313A priority Critical patent/CN1655232B/en
Publication of CN1655232A publication Critical patent/CN1655232A/en
Application granted granted Critical
Publication of CN1655232B publication Critical patent/CN1655232B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

This invention relates to context-dependent Chinese phone identifying and modeling method, which applies initial consonant right-dependent and final sound left dependent modeling method including: a, creating a context-dependent basic modeling unit by relating the initial consonant with the adjacent right final sound and relating the final sound with its adjacent left initial consonant, b, utilizing the state clustering method to train the model parameters to get an initial HMM, c, utilizing the sub-space clustering method to compress the HMM to generate a final model.

Description

Context-sensitive Chinese speech identification modeling method
Technical field
The present invention relates to a kind of speech recognition modeling method, particularly can be applied to the context-sensitive Chinese Acoustic Modeling method of embedded device.
Background technology
Speech recognition technology is to allow machine voice signal be changed into the technology of corresponding text or order by identification and understanding process.Speech recognition technology combines with speech synthesis technique and can make people can get rid of keyboard, operates by voice command, carries out speech exchange with machine.Recent two decades comes, and along with fast development of computer technology, speech recognition technology is obtained marked improvement, begins to move towards market from the laboratory.People estimate that in following 10 years, speech recognition technology will enter every field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product.
But, in present stage, move towards in the process of practical application at speech recognition technology, also exist the contradiction between computing power, storage capacity and the speech recognition system discrimination of computing machine.How carrying out high-precision Acoustic Modeling on the limited embedded device of internal memory is a guardian technique problem that is related to speech recognition system practicability.
The open CN1264468A of Chinese patent has disclosed a kind of dictation system that phonetic entry is transformed into the computing machine execution of literal.This system utilizes literal to produce the spoken translation of a given word to the structure of voice, and the spoken translation of this given word exports on sound device, and how a given word will pronounce so that the user of speech recognition system knows this speech recognition expectation.
The open CN1288225A of Chinese patent has disclosed a kind of speech recognition system and speech recognition controlled method.The technical scheme of this invention is the voice of storing in advance in prerecorded speech recognition table as the operator of expectation value.When the terminal electronic equipment of any uncheck is connected to control device, control device is deposited the speech recognition table that electronic equipment provides, and when the operator imports voice, control the I/O of electronic equipment by control device relatively, then according to the comparative result of operator's voice and speech recognition table the speech recognition table of operator's voice and the electronic equipment deposited in advance.
Adopt context-sensitive phoneme modeling method in the speech recognition system that is disclosed among the open CN1264468A of Chinese patent, though the acoustic model of Jian Liing has higher precision like this, but the volume ratio of model is bigger, be difficult to directly be encased in the internal memory of embedded device, be difficult to satisfy the practical application needs of embedded device.
The problem that exists in the above-mentioned publication is that its required internal memory is bigger, is not suitable for using in embedded device.
Summary of the invention
The present invention proposes right relevant, the left relevant context dependent phoneme modeling method of simple or compound vowel of a Chinese syllable of a kind of initial consonant based on state clustering in conjunction with the characteristics of Chinese.The acoustic model that uses this modeling method to train out has higher model accuracy and less model volume, is particularly suitable for the use occasion of the less embedded device of some internal memories.For the volume of further compressed acoustic model, avoid the precision of loss model simultaneously as far as possible, to guarantee the performance of speech recognition system in embedded device, the present invention adopts the subspace clustering algorithm that acoustic model is compressed.Substantially do not losing under the prerequisite of speech recognition system performance, this method can be compressed to 1/10~1/5 of original model size to the acoustic model of speech recognition system.
The purpose of this invention is to provide a kind of context-sensitive Chinese Acoustic Modeling method that is applicable to embedded device, this method can make the likelihood probability loss of all training samples minimum in the initial model training process, in the compression process of model without any need for the participation of corpus, reduction that thus can convenient and swift implementation model scale, and guarantee that the precision of model does not have big loss.
The invention provides a kind of context-sensitive Chinese Acoustic Modeling method that is applicable to embedded device, comprise step: (a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit; (b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; (c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.
On the other hand, the present invention can provide a kind of computer readable recording medium storing program for performing, be used to store the program of carrying out context-sensitive Chinese speech identification modeling method, (a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit; (b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; (c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.
The present invention mainly is the characteristics at Chinese speech pronunciation, proposes right relevant, the left relevant semitone joint of the simple or compound vowel of a Chinese syllable Diphone modeling method of a kind of initial consonant; With the likelihood probability of all training samples loss minimum is target, and the state output of adopting the method for state clustering to carry out the Diphone model distributes shares and the training of model parameter; Adopt the subspace clustering algorithm to carry out the acoustic model compression.Compress the special requirement of this application at model, that at first adopts the LBG algorithm carries out cluster to generate initial Gauss's code book to original Gauss model, adopts the K-Means clustering algorithm that Gauss's code book is optimized then, generates final Gauss's code book.The major advantage of the method is without any need for the participation of corpus in the compression process of model.
Description of drawings
To make the features and advantages of the present invention become clearer by reading instructions with reference to the accompanying drawings, wherein:
Fig. 1 is the process flow diagram according to the model training of the Chinese Acoustic Modeling of the embodiment of the invention;
Fig. 2 is the topological structure of expression hidden Markov commonly used;
Fig. 3 shares algorithm flow chart according to the output distribution based on state clustering of the embodiment of the invention;
Fig. 4 is that synoptic diagram is shared in expression hidden Markov state clustering output distribution; With
Fig. 5 is that expression is according to the model compression process figure based on subspace clustering of the present invention.
Embodiment
The ultimate principle of speech recognition at first is described below.
Speech recognition comprises two basic processes, i.e. training process and identifying.The main task of training process is to utilize a large amount of voice training samples, sets up acoustic model, is used to describe acoustic layer knowledge.In the recognition system of complexity, also need to utilize a large amount of corpus of text, train language model is used for descriptive language layer knowledge.In identifying, utilize the acoustic model and the language model that obtain in the training process, speech samples to be measured is decoded, it is identified as text.The technological innovation that this patent is described mainly concentrates on the acoustic training model process of training stage.
Chinese has self unique language characteristic as a kind of language, utilizes these characteristics to carry out the acoustic model modeling, can improve the performance of model to greatest extent when reducing the model volume.
Following Chinese and common comparison of western language English do.The most significant difference is that Chinese is a kind of pictograph, and English is a kind of alphabetic writing.The linguistic unit of the minimum in the English is a speech, the creation that speech can not stopped, and the number of speech is also in the variation that does not stop.When new things or new ideas occurring, all need to create again neologisms usually and describe it.Aspect pronunciation, the pronunciation of each speech of English is the pronunciation that has been interconnected to constitute whole speech by the syllable of some, and the coupling between the different syllables is very strong.And the minimum component unit of Chinese is a word, and different things and notion described in these words formation speech that can separately or combine with one another.As seen, " word " is that Chinese substantially the most also is relatively independent formation unit, and the notion of this " word " is not have in the English.Aspect pronunciation, each word all is an independently syllable in the Chinese, and each syllable connects a simple or compound vowel of a Chinese syllable after by an initial consonant and forms.408 different syllables are formed in the pronunciation of all Chinese characters.Because Chinese is when word-building, word is relatively independent, so the independence between the syllable of its pronunciation is also stronger relatively.The present invention utilizes the relative independentability of the inter-syllable of Chinese to carry out high-precision acoustic feature modeling just.
Fig. 1 is the overview flow chart of expression acoustic training model process.At first,, select basic modeling unit, and define the context coupled relation between the basic modeling unit at step S11.Then, utilize the method for state clustering, utilize the voice training data, the parameter of hidden Markov model is trained, obtain initial HMM at step S12.Next, utilize the subspace clustering algorithm that initial model is compressed, and obtain final model at step S13.
Below with reference to accompanying drawing 1, to shown in the details of each step in the process flow diagram specifically describe.
1. select basic modeling unit
Before beginning to train acoustic model, must at first define basic modeling unit, just determine the granularity of each model description.Basic modeling unit can have multiple choices in the speech recognition, can be as basic modeling unit such as phoneme, semitone joint, syllable or speech etc.As previously mentioned, each word of Chinese all is a syllable, and each syllable is made up of two parts of initial consonant and simple or compound vowel of a Chinese syllable, so Chinese speech recognition system is basic modeling unit with the sound mother mostly, is called semitone and saves modeling.We select equally, and sound is female to be basic modeling unit, and wherein initial consonant has 27, and simple or compound vowel of a Chinese syllable has 38, also have in addition one quiet.
2. define the context dependence between the basic modeling unit
So-called " context dependent " is meant, when pronouncing continuously, itself pronunciation is not only depended in the pronunciation of each elementary cell, and is also relevant with the residing linguistic context in this unit, in brief, is exactly relevant with the pronunciation of some other elementary cells before and after the active cell.For example, in " China (zh-ongg-uo) " and " central authorities (zh-ong y-ang) " two speech, because the guo that the back connect is different with the yang pronunciation, the pronunciation of so same elementary cell (zh) ong is different.
The front labor characteristics of Chinese language, the relative independentability when having pointed out the Chinese word-building between the syllable.The present invention utilizes the mutual independence between the syllable to define the context dependence of basic modeling unit just.We suppose that between the pronunciation of different syllables be separate, the pronunciation of initial consonant only and in being in same syllable the simple or compound vowel of a Chinese syllable on the right relevant, the pronunciation of simple or compound vowel of a Chinese syllable simultaneously be in same syllable in the initial consonant on the left side relevant.
This context dependence defines, and meets the characteristics of Chinese speech pronunciation, can portray the coarticulation phenomenon of syllable inside on the one hand more exactly, owing to only need the monolateral linguistic context of consideration, can significantly reduce the number of model on the other hand.By simple computation as can be known, if consider the linguistic context (being the triphone modeling) of the right and left simultaneously, total number of model is 27 * 39 * 39+38 * 28 * 28+1=70860 so, according to our correlativity definition (Diphone modeling just), total number of model drops to 27 * 38+38 * 27+1=2053.The pressure that next step model training and compression wait each step has been alleviated in model number reduction greatly, makes model be more suitable for use in embedded system.
3. lose as distance criterion with likelihood probability, adopt the method for state clustering to carry out the training of model parameter
Hidden Markov model (HMM) is the modeling method of main flow in the speech recognition.The topological structure of HMM from left to right commonly used as shown in Figure 2.Wherein each state arrays from left to right, can be between state self or the different conditions according to the sensing of arrow according to certain probability redirect, and a probability density function (pdf) is adhered in each state output.
Owing to directly adopt the model parameter estimation of all Markov models of existing training data set pair that certain difficulty is arranged, and such number of parameters of the model that comes out of training also can be many, the amount of ram that model takies is bigger, can not satisfy the needs that use in embedded device.Therefore need to adopt the way of state clustering, between the different conditions of different models, carry out state output distribution parameter and share, thereby minimizing model parameter number and assurance can obtain the parameter estimation of robust through the model parameter of state shared output distribution.
The process of state clustering algorithm is as follows:
At first, initialization is carried out in model state output distribution.Suppose that each HMM comprises three states, The initial segment, interlude and the ending segment of the voice observation sample of their respectively corresponding certain basic modeling unit.The feature space of pairing all these states of composition of sample of same state.In the time of initial, the gauss hybrid models of two mixing of use is described each state space of HMM.For the state of observation sample number less than certain threshold value, the fixing gauss hybrid models of its feature space user difference is described.
After this, the output of Share Model state distributes.At first define the context dependent phoneme model (all-phone) of certain specific basic modeling unit correspondence:, all b referred to as the context dependent phoneme model of this modeling unit owing to different all HMM that produce of the residing context of co-text of this element for certain specific modeling unit.For example, for initial consonant b, b-a, b-an, b-o, b-u etc. are the context dependent phoneme models of b.In the state clustering process, the different conditions that only is in same position in the context dependent phoneme model to same modeling unit launches cluster.
Specifically describe the flow process that distributes and share based on the output of state clustering of the present invention below in conjunction with Fig. 3.
At step S31, calculate the loss that merges the likelihood probability that is caused between any two states.The loss of likelihood probability is to calculate with following formula (1):
Dis = Σ k , k ∈ C log P ( o k ) - Σ k , k ∈ C 1 log P 1 ( o k ) - Σ k , k ∈ C 2 log P 2 ( o k ) - - - - ( 1 )
As previously mentioned, the feature space of each state all uses the gauss hybrid models of two mixing to describe, P (o k) represent the observation probability density function on this feature space, o kBe input observation sample vector.C 1And C 2Two state class before expression merges respectively, C represents by C 1And C 2Merge the later state class that is generated.
At step S32, all possible state that calculated from step S31 merges the merging of seeking two minimum state class of likelihood probability loss in the set.Then, at step S33, whether the number of samples of judging these two state class is greater than a fixing threshold value (M).If judged result is for affirming that flow process then forwards step S34 to, this merging is deleted, and turn back to step S32 after this from the set of above-mentioned merging.If the judged result of step S33 is for negating, promptly, if the number of samples that has a state class in these two state class at least is less than this fixing threshold value, then at step S35 these two state class have been merged and generate a new state class, the feature space of new state class is described with the gauss hybrid models of two mixing again.After this, judge that at step S36 whether the corresponding observation sample number of each state class is greater than another fixing threshold value.If the mixed Gauss model that each state output after the judged result of step S36 for affirming that treatment scheme then proceeds to step S37, adopts the K-Means clustering algorithm to be combined distributes carries out parameter estimation.On the contrary, if in the judged result of step S36 for negating, that is, the number of samples of at least one state is not more than this threshold value N, flow process then turns back to step S31, continues the likelihood probability loss of asking any two states to merge.
Fig. 4 represents is that the feature space of the state class that combines is shared identical output and distributed.In addition, in above-mentioned state clustering process, the Gaussian Mixture number of mixed Gauss model can be redefined for a fixing value, also can dynamically determine according to certain criterion (for example BIC criterion).
4. adopt the subspace clustering algorithm to carry out the compression of acoustic model
The volume of the acoustic model that employing state clustering method trains out is bigger usually, and directly being used in still has certain difficulty in the embedded system.Therefore the present invention has also comprised the model compress technique based on subspace clustering, utilizes this technology, substantially not under the prerequisite of loss system recognition performance, and can be the volume compression of acoustic model to 1/10~1/5 of original size.Describe the specific operation process of this model compress technique in detail below with reference to Fig. 5.
At first, in step S51 definition subspace.In speech recognition, be that identification and training all will be extracted characteristic parameter from voice, multi-C vector normally, the institute of these characteristic parameter vectors might value constitute an original multidimensional feature space.So-called " subspace " is meant that some feature dimensions that have correlativity most combine the feature space that is generated.The division of subspace can also can be combined a few dimensional features of related coefficient maximum in the related coefficient of obtaining on the primitive character space between each dimensional feature according to empirical method by artificial delimitation, constitutes a proper subspace.By the definition subspace, the primitive characteristics space is divided into the several features subspace.
Finish after the definition of subspace, original Gauss model is decomposed at step S52.Each Gauss model that is based upon in the gauss hybrid models on the primitive character space just can be expressed as the product that is based upon the sub-Gauss model on each sub spaces:
P ( O ) = Σ m M c m ( Π k = 1 K N Tied ( O k ; μ mk , σ mk ) - - - - ( 2 )
Above formula (2) represented be exactly after the Subspace Decomposition through the primitive character space, P (O) computing formula that the model of the gauss hybrid models when the observation sample vector o input in primitive character space is given a mark.Wherein, M represents the mixed number of gauss hybrid models, and K represents the number of subspace.N Tied(O kμ Mk, σ Mk) be the sub-Gauss model that is based upon on the k sub spaces.o kBe among the observation sample vector o corresponding to the observation sample subvector of k sub spaces, parameter { μ Mk, σ MkBe the sub-Gauss model parameter of original Gauss model corresponding to the k sub spaces.
After this, flow process proceeds to step S53, and promptly the sub-Gauss model of all that on each sub spaces decomposition are obtained carries out cluster.Clustering result will generate Gauss's code book of some on each subspace, the final acoustic model of recognition system is generated by these Gauss's code book combinations.The cluster process of sub-Gauss model mainly upgrades these two links by the optimization of the initialization of Gauss's code book and Gauss's code book and constitutes.
At step S53, carry out the initialization procedure of Gauss's code book.Adopt the LBG algorithm in this course, original sub-Gauss models all on the subspace is carried out cluster, generate initial Gauss's code book.Concrete operations are as follows: at first generate Gauss's code book center, the Mean Parameters at this center can be obtained by the direct estimation in the average center of all sub-Gauss models, and its variance parameter can be obtained by the direct calculating mean value of the variance of all sub-Gauss models.Finish after the initialization at Gauss's code book center, just begin the process of repetition " dichotomy division-cluster-code book upgrades ".When carrying out the dichotomy division, adopt method of perturbation that each original Gauss's code book is split into two new Gauss's code books.After this, the code book that all training samples are obtained by division is cluster again, and the center of calculating each class, obtains new code book thus.The process of this " division-cluster-code book upgrade " repeats down always, reaches predetermined value up to the number of code book.In cluster process, when carrying out the reclassifying of sub-Gauss model, adopt following formula as distance criterion, weigh the distance of any one sub-Gauss model to Gauss's code book center.Common sub-Gauss model is divided in Gauss's class apart from its nearest Gauss center representative and goes.
D bhat = 1 8 ( μ 2 - μ 1 ) T [ Σ 1 + Σ 2 2 ] - 1 ( μ 2 - μ 1 ) T + 1 2 ln | Σ 1 + Σ 2 2 | | Σ 1 Σ 2 | - - - - ( 3 )
Represented distance is Gauss model N{ μ in the formula (3) 11To N{ μ 2, ∑ 2Between distance.
After the initialization of finishing Gauss's code book, also to be optimized renewal process to initial Gauss's code book at step S54.This process is exactly to adopt broad sense K-means clustering algorithm that Gauss's code book of each sub spaces is optimized again.Carry out repartitioning of sub-Gauss model according to formula (3) during optimization, and carry out the renewal at Gauss model code book center according to following formula (4) and (5):
μ = n 1 μ 1 + n 2 μ 2 n 1 + n 2 - - - - ( 4 )
Σ = n 1 ( μ 1 - μ ) T ( μ 1 - μ ) + n 2 ( μ 2 - μ ) T ( μ 2 - μ ) n 1 + n 2 + n 1 Σ 1 + n 2 Σ 2 n 1 + n 2 - - - - ( 5 )
The mean variance parameter of the Gauss model after { μ, ∑ } expression is upgraded in the formula, { μ 1, ∑ 1And { μ 2, ∑ 2Expression participates in the model parameter of two sub-Gauss models of cluster, n 1And n 2Number of samples when being these two Gaussian distribution of estimation.
It should be noted that in each stage of model compression all do not have the raw tone training sample to participate in, thereby compare with subspace model compression method in the past, the method that the present invention proposes uses more flexible, more convenient.
The present invention is mainly used in the speech recognition system of some embedded devices.Its major function is when reducing the acoustic model volume, keeps the recognition performance of original speech recognition system, thereby even in the use occasion of the very little embedded device of memory size, makes the possibility that is configured to of high-precision speech recognition system.
To compare the model training method among the present invention and the performance of general model algorithm below, thereby validity of the present invention will be described.
1) continuous speech syllable identification
The fundamental purpose of this test is that the model that trained for the acoustic training model method in the patent of the present invention relatively and common acoustic training model method is in the difference aspect the model accuracy.
Training data is 83 male sex of 863 training datasets people that record.Test data is the tested speech data of standard testing male voice 240 word of 863 test sets.The Triphone model that acoustic model adopts Diphone model that the modeling method among the present invention trains and the common method based on decision tree to train respectively, test findings is as follows:
Table 1 Diphone model and Triphone model are when the continuous syllable identification
Model accuracy relatively
Discrimination when syllable is discerned continuously
Diphone of the present invention ?????80.24%
Triphone commonly used ?????82.02%
2) mobile phone name identification
The fundamental purpose of this test is in order to test the recognition performance of acoustic model in a name identification mission that acoustic training model algorithm proposed by the invention is trained.The Triphone model that acoustic model adopts Diphone model after overcompression that the modeling method in the patent of the present invention trains and the common method based on decision tree to train respectively.83 male sex that training data remains 863 training datasets people that records.And test data becomes 2500 names of 10 people that record under the laboratory environment.
Table 2 Diphone model and Triphone model are when name is discerned
Model accuracy relatively
Discrimination during name identification
Diphone of the present invention ?????96.90%
Triphone commonly used ?????97.00%
3) the model volume relatively
Because in embedded device, very little of memory size.Therefore except the index of simple model accuracy, the size of model also is to weigh a comprehensive index of system performance.
In reaching table 2 as a result the time, the committed memory of Diphone model of the present invention size is 1/10 of Triphone model commonly used.
4) effect of state clustering in the structure of Diphone model
When generating the Diphone model, than Diphone model higher model accuracy is arranged based on decision tree based on the Diphone model of state clustering.Table 3 provides based on the Diphone system of state clustering with based on the Diphone system of the decision tree discrimination in syllable identification continuously:
Table 3: based on the Diphone of state clustering with based on decision tree
The model accuracy of Diphone relatively
Discrimination when syllable is discerned continuously
Diphone based on state clustering ?????80.24%
Triphone based on decision tree ?????79.01%
5) based on the model compress technique of subspace clustering in the contribution that reduces aspect the internal memory
If every bidimensional characteristics combination constitutes a proper subspace together, whole feature space can be divided into D/2 proper subspace so, and D represents the dimension in primitive character space here.Suppose that each proper subspace adopts 125 sub-Gauss model code books, the size of so whole acoustic model can be reduced to original 1/5-1/10.The model accuracy loss of so this acoustic model after overcompression can draw from the test findings of following table:
Table 4: based on the influence of the model compress technique of subspace clustering to system performance
The discrimination of continuous syllable identification The discrimination of name identification
Diphone without overcompression ?????80.24% ?????96.90%
Diphone through overcompression ?????78.34% ?????96.87%
From above test figure as can be seen, adopt Acoustic Modeling method of the present invention to compare with the triphone that uses always, the discrimination when syllable is discerned continuously only has small reduction, but its volume reduces greatly.In addition, by data cited in top table 3 and 4 as can be seen, cluster and model compress technique that the present invention adopts almost can be ignored the influence of speech recognition accuracy rate, and the volume of its acoustic model only is original 1/5-1/10, has greatly saved storage space.Be fit to very much be applied to such as internal memory small device such as embedded devices.
Above-described method can realize by hardware or software.The program of carrying out this method can be recorded in such as floppy disk, hard disk, and CD-ROM is on the computer-readable recording medium of DVD-ROM and so on.
Though invention has been described with reference to certain embodiments, the present invention is not limited thereto, and only be defined by the following claims, and those skilled in the art can carry out various changes and improvements to embodiments of the invention under the situation that does not break away from spirit of the present invention.

Claims (9)

1. a Chinese speech is discerned modeling method, comprises step:
(a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit;
(b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; With
(c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.
2. wherein said Chinese speech identification modeling method according to claim 1, step (b) further comprises step:
(b1) calculate the loss that merges the likelihood probability that is caused between any two states;
(b2) all possible state that calculated from step b1 merges the merging of seeking two minimum state class of likelihood probability loss in the set;
(b3) judge that whether the number of samples of these two state class is greater than a fixing threshold value;
(b4), this merging is deleted from the set of above-mentioned merging if the judgement sample number is greater than a fixing threshold value in step (b3); If the number of samples that has a state class in these two state class at least is less than this fixing threshold value, then these two state class have been merged and generate a new state class, the feature space of new state class is described with the gauss hybrid models of two mixing again;
(b5) whether the number of samples of judging each state class greater than another fixing threshold value, if greater than described another fixed threshold, the mixed Gauss model that each state output after then adopting the K-Means clustering algorithm to be combined distributes carries out parameter estimation; If the number of samples of at least one state is not more than described another fixed threshold, then turn back to step (b1).
3. Chinese speech identification modeling method according to claim 2, the Gaussian Mixture number of wherein said mixed Gauss model can be redefined for a fixing value, also can dynamically determine.
4. Chinese speech identification modeling method according to claim 1, wherein step (c) employing LBG algorithm carries out cluster to generate initial Gauss's code book to master pattern.
5. Chinese speech identification modeling method according to claim 4, wherein step (c) comprises that also employing K-Means clustering algorithm is optimized and generates final Gauss's code book to Gauss's code book.
6. Chinese speech identification modeling method according to claim 1, wherein step (c) also comprises step:
Extract characteristic parameter from voice, combination has the feature dimensions of correlativity most with generated subspace;
Each Gauss model in the gauss hybrid models on the primitive character space is carried out Subspace Decomposition to obtain sub-Gauss model parameter;
The sub-Gauss model of all that on each sub spaces decomposition are obtained carries out cluster, so that generate the Gauss's code book with given number on each subspace;
Make up resulting Gauss's code book to generate final acoustic model.
7. Chinese speech identification modeling method according to claim 6, the cluster step of its neutron Gauss model also comprises carries out cluster to generate the step of initial Gauss's code book to all the original sub-Gauss models on the subspace.
8. discern modeling method according to the Chinese speech of claim described 6, the cluster step of its neutron Gauss model also is included in after the initialization of finishing Gauss's code book center, carries out the dichotomy division so that each original Gauss's code book is split into the step of two new Gauss's code books.
9. a computer readable recording medium storing program for performing is used to store the program of carrying out Chinese speech identification modeling method, and described method comprises step:
(a) initial consonant with Chinese speech is relevant with the simple or compound vowel of a Chinese syllable on its right side of next-door neighbour, and the initial consonant in its left side of simple or compound vowel of a Chinese syllable and next-door neighbour is relevant, creates context-sensitive basic modeling unit;
(b) utilize the state clustering method that the parameter of model is trained, to obtain initial hidden Markov (HMM) model; With
(c) utilize the subspace clustering algorithm that initial hidden Markov model is compressed, to produce final mask.
CN2004100041313A 2004-02-13 2004-02-13 Context-sensitive Chinese speech recognition modeling method Expired - Fee Related CN1655232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2004100041313A CN1655232B (en) 2004-02-13 2004-02-13 Context-sensitive Chinese speech recognition modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2004100041313A CN1655232B (en) 2004-02-13 2004-02-13 Context-sensitive Chinese speech recognition modeling method

Publications (2)

Publication Number Publication Date
CN1655232A true CN1655232A (en) 2005-08-17
CN1655232B CN1655232B (en) 2010-04-21

Family

ID=34891981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2004100041313A Expired - Fee Related CN1655232B (en) 2004-02-13 2004-02-13 Context-sensitive Chinese speech recognition modeling method

Country Status (1)

Country Link
CN (1) CN1655232B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101292429B (en) * 2005-11-18 2012-04-04 英特尔公司 Method and device of compression using multiple markov chains
CN102693723A (en) * 2012-04-01 2012-09-26 北京安慧音通科技有限责任公司 Method and device for recognizing speaker-independent isolated word based on subspace
CN102982799A (en) * 2012-12-20 2013-03-20 中国科学院自动化研究所 Speech recognition optimization decoding method integrating guide probability
CN105810192A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Speech recognition method and system thereof
CN105895089A (en) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2296846A (en) * 1995-01-07 1996-07-10 Ibm Synthesising speech from text
US5680510A (en) * 1995-01-26 1997-10-21 Apple Computer, Inc. System and method for generating and using context dependent sub-syllable models to recognize a tonal language
US6064957A (en) * 1997-08-15 2000-05-16 General Electric Company Improving speech recognition through text-based linguistic post-processing
CN1141697C (en) * 2000-09-27 2004-03-10 中国科学院自动化研究所 Three-tone model with tune and training method
US6754626B2 (en) * 2001-03-01 2004-06-22 International Business Machines Corporation Creating a hierarchical tree of language models for a dialog system based on prompt and dialog context

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101292429B (en) * 2005-11-18 2012-04-04 英特尔公司 Method and device of compression using multiple markov chains
CN102693723A (en) * 2012-04-01 2012-09-26 北京安慧音通科技有限责任公司 Method and device for recognizing speaker-independent isolated word based on subspace
CN102982799A (en) * 2012-12-20 2013-03-20 中国科学院自动化研究所 Speech recognition optimization decoding method integrating guide probability
CN105810192A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Speech recognition method and system thereof
CN105810192B (en) * 2014-12-31 2019-07-02 展讯通信(上海)有限公司 Audio recognition method and its system
CN105895089A (en) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device
WO2017113739A1 (en) * 2015-12-30 2017-07-06 乐视控股(北京)有限公司 Voice recognition method and apparatus

Also Published As

Publication number Publication date
CN1655232B (en) 2010-04-21

Similar Documents

Publication Publication Date Title
US10629185B2 (en) Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model
CN104143327B (en) A kind of acoustic training model method and apparatus
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
CN1645478B (en) Segmental tonal modeling for tonal languages
CN1112669C (en) Method and system for speech recognition using continuous density hidden Markov models
US8892443B2 (en) System and method for combining geographic metadata in automatic speech recognition language and acoustic models
CN1139911C (en) Dynamically configurable acoustic model for speech recognition systems
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
CN1571013A (en) Method and device for predicting word error rate from text
CN107093422B (en) Voice recognition method and voice recognition system
Cui et al. Multi-view and multi-objective semi-supervised learning for hmm-based automatic speech recognition
Georgescu et al. Performance vs. hardware requirements in state-of-the-art automatic speech recognition
Kadyan et al. A comparative study of deep neural network based Punjabi-ASR system
CN105009206A (en) Speech-recognition device and speech-recognition method
CN111243599A (en) Speech recognition model construction method, device, medium and electronic equipment
Kannadaguli et al. A comparison of Bayesian and HMM based approaches in machine learning for emotion detection in native Kannada speaker
CN1144172C (en) Sounder based on eigenfunction sound including maxium likelihood method and environment adaption thereof
Yu et al. Large-margin minimum classification error training: A theoretical risk minimization perspective
Sharma et al. Automatic speech recognition systems: challenges and recent implementation trends
Shahin Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden Markov models
CN1655232A (en) Context-sensitive Chinese speech recognition modeling method
Ons et al. Fast vocabulary acquisition in an NMF-based self-learning vocal user interface
Seid et al. A speaker independent continuous speech recognizer for Amharic
CN115359778A (en) Confrontation and meta-learning method based on speaker emotion voice synthesis model
He et al. Optimization in speech-centric information processing: Criteria and techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100421