CN1161703C

CN1161703C - Integrated prediction searching method for Chinese continuous speech recognition

Info

Publication number: CN1161703C
Application number: CNB001249711A
Authority: CN
Inventors: 波徐; 徐波; 黄泰翼
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2000-09-27
Filing date: 2000-09-27
Publication date: 2004-08-11
Anticipated expiration: 2020-09-27
Also published as: CN1346112A

Abstract

The present invention relates to an integral forecasting and searching method for Chinese continuous speech recognition, which belongs to the field of automatic speech recognition. The present invention is basically characterized in that a three-voice sub-model with tone and a ternary word statistical language model are integrally searched for one time, and a language model is forecasted in the process of decryption. The present invention relates to the problems on the aspects of the integral searching method, organizations of word stocks, search of the forecasting language model and the reduction of local search local searching paths.

Description

The integrated prediction searching method of Chinese continuous speech identification

Technical field

The invention belongs to the technical field of Chinese continuous speech identification, be meant a kind of integrated prediction searching method of Chinese continuous speech identification especially, it is applicable to that also any needs carry out integrated and comprehensive pattern-recognition of knowledge and the search problem in the artificial intelligence.

Background technology

The more successful way of speech recognition is based on statistical model at present, and its fundamental characteristics is exactly some adjustable parameters, and these parameters can directly be inferred from observed data.Suppose that A represents the acoustics observed data that recognizer will be decoded, the word series that W expresses possibility, P (W/A) represents given observation A, the probability that word series W is said, by statistical decision, recognizer should be maked decision according to following formula:

\tilde{W} = \underset{W}{\arg \max} P (W | A)

(formula 1)

So, formula 1) can further be write as:

\tilde{W} = \underset{W}{\arg \max} P (A | W) P (W)

(formula 2)

Wherein P (W) is the probability that word strings W is said, P (A|W) is that the word strings that hypothesis is said is to observe the probability of data A under the W situation, recognition system can obtain explanation by Fig. 1, and recognizer comprises front-end processing, acoustic model P (A|W), language model P (W) and searching algorithm.Searching algorithm is exactly to find the word sequence with maximum probability under the condition of acoustic model, language model and acoustics characteristic sequence

The basic search algorithm mainly contains the Viterbi-beam search of time synchronized and the A* searching algorithm of depth-first.Through studying effort for many years, for reducing the huge calculated amount of search, having occurred with multipass search (Multi-Pass) is the continuous speech recognition search framework of representative.A basic thought of these frameworks is exactly to add senior acoustic model knowledge and language model knowledge gradually, utilizes last time Search Results to inspire the back one time search procedure of quickening.The multipass search framework can be divided into two classes according to the output of intermediate result: N sentence (N-Best) that the probability score is the highest of the general direct output of the first kind, second class produce a medium term figure as next grammer all over search.In fact the sentence of N-Best itself also can produce from speech figure, and speech figure can think an intermediate result of N-Best algorithm, and their relation as shown in Figure 2.

From the basic search algorithm, the time frame synchronous searching is widely adopted in speech recognition, and it comes down to a dynamic programming technology based on Viterbi Beam search, can be regarded as time-progradation on the model state grid.If the frame synchronization searching algorithm has been handled t constantly, this moment, corresponding sub-observation sequence was: Y ₂Y ₂Λ Y _t, this moment this subpath W _lThe state node of residing basic-element model and primitive inside is respectively λ _tAnd s _t, then the score of subpath can be defined as:

Pr (W _l)=∏ Prob (Y _i| λ _t, s _t, W _l) (formula 3)

At formula 3) in, t+1 extended mode constantly is subjected to the constraint of grammer: in HMM inside, St is subjected to the constraint of HMM topological structure, and the expansion of λ t is subjected to the constraint of dictionary, and the expansion of Wl then is subjected to the constraint of language model between the speech.Wherein the most basic constraint is a dictionary in the Chinese continuous speech identification, and related between hunting zone and speech and the speech is that language model also depends on dictionary.The general tree-shaped organizational form that adopts as shown in Figure 4 of dictionary in this drawing, is represented a paths from the root node to the leaf node, the corresponding one group of homonym (as " pressing " among Fig. 4 and " secretly ") of the leaf node in this path.By this representation, can share start-up portion common in the dictionary fully, reduce the number of path of search, improve search efficiency.

Dictionary is according to the data organization of carrying out shown in Figure 4.If a local path i in the search procedure is by Path (i)={ W ₁, W ₂, n, s ... expression, here, W ₁, W ₂Represent two historical speech in front of this paths, n represents the node number of current path in tree, and s represents the residing state of the HMM of current path in node n.Then when state s is last state of HMM, this paths will go from node n jumps to the expanding node of n.As shown in Figure 7.Suppose that m is one of them expanding node of node n, then when when node n expands to m, become path j from path i, owing to jump to node m from node n, this path does not also arrive leaf node, so the identity of its speech is not determined as yet, thereby its language probability is constant, and this transition does not simultaneously take the acoustics time, thereby its acoustics score does not change yet, so overall score does not change, i.e. Prob (Path _j)=Prob (Path _i).This tree-shaped search extension method, systematic search point have only to determining the speech number of this speech during the leaf node of tree, and the linguistry adding has very big delay, thereby causes irrecoverable error.While is owing to score difference between the path is little, even some is identical, causes and reduces difficulty.

In search procedure, the score in most of path is very low in addition.Keeping the lower path of these scores is unpractical on room and time, also is unnecessary.Thereby we can dynamically carry out beta pruning to the path in the frame synchronization search procedure, abandon and wish little path.In the N of all expansions paths, current optimal path is expressed as:

Wx=argmax (Pr (W _l)) l＜=N (formula 4) wherein

Can be by setting a thresholding BEAM, all scores are at Pr (W _M) and BEAM*Pr (W _M) between the path will obtain keeping, carry out next step expansion, and delete all the other paths.BEAMSEARCH significantly reduces volumes of searches (order of magnitude that probably has only Beam* input measurement vector sequence length) like this.In traditional reduction strategy, generally adopt simple gate limit strategy, if i.e. probability P＜=Beam*Pr (W of a paths _M), then this local path is just reduced.But because in search procedure, the dynamic score in path is constantly changing, and reduces too much, can bring many Search Errors; Reduce very little, influence recognition speed again.The most direct way is the number in control path, but this need can bring more calculated amount to the operation such as sort of all paths.

Summary of the invention

The objective of the invention is to make full use of powerful band and transfer three-tone acoustic model and ternary speech language model, search out the result of an optimum once, overcome ubiquity problem in the multipass search framework, as multipass search problem 1) can not organize all knowledge sources together and decode, so its algorithm is not optimum, and mistake is propagation and enlarges; Multipass search problem 2) adopt fairly simple acoustic model and language model in the pre-search in front, the wrong possibility of bringing is bigger.

Other purpose of the present invention is by the design to the dictionary tissue, makes the prediction of language become possibility, and could add the language probability after not needing to reach root node, accelerates search speed.The integrated prediction searching method of a kind of Chinese continuous speech identification of the present invention is characterized in that, the statistical language model of three-tone model with tune and ternary speech is carried out integration search once, and carry out the prediction of language model in decode procedure; The core algorithm of search adopts the synchronous multi-threshold cutting search of time frame, utilizes the special construction of dictionary and the detection that the ternary statistical model is predicted language model in search procedure;

The search dictionary is by tree-shaped tissue and have following architectural feature:

1) speech in the dictionary is numbered, the principle of numbering is that the speech numbering is with consistent by putting in order of the pairing speech of tree-shaped tissue back leaf nodes;

2) dictionary of tree-shaped tissue, the node of its each representative model contain the numbering Wx and the Wy of two speech, the scope of the speech that expression can be expanded from this node, and promptly the scope from the speech of this node expansion drops between Wx and the Wy;

3) if node m is obtained by node n expansion, then must have:

W _Mx＜=W _Nx＜=W _Ny＜=W _MyW wherein _Mx, W _MyBe speech scope from node m expansion; W _Nx, W _NyBe speech scope from node n expansion;

The three-tone model that its acoustic model has adopted band to transfer, band is transferred the model of three-tone rhythm pattern master not only to depend on the initial consonant on left and right limit but also is depended on the simple or compound vowel of a Chinese syllable tone on the second from left, the right side two and the tone of itself, thereby in the tree-shaped tissue of dictionary, add tone information, by the tone information of simple or compound vowel of a Chinese syllable is attached on the corresponding initial consonant of same syllable, making only needs pre-expansion one deck node in search procedure;

Wherein adopted a kind of beam search of multi-threshold; Set n probability threshold P ₀, P ₁, P ₂..., P _n, P _iBe I thresholding, algorithm is as follows:

A) judge between a certain the scoring area that the path fell that must be divided into P, if i.e. P _i=＜P＞=P _I+1, then this paths thinks that to drop on i interval; This path, interval counter Ci adds 1;

B) for i=1 ..., N calculates the 1st to i interval accumulative total path number Si;

Si = Σ_{j = 1}^{i} Cj

Wherein Cj is a j interval path counter, finds the minimum i that satisfies Si＞=CountThread, then reduces thresholding and just is Pi, wherein the activated path number of CountThread for controlling according to system's needs;

C) reduce the path according to the Pi thresholding;

Wherein adopted the prediction of three gram language model; When node n expanded to node m, the computing formula of prediction was:

Prob(Path _j)＝[Prob(Path _i)-ProbLm(W ₁，W ₂，n)]+ProbLm(W ₁，W ₂，m)

At formula Prob (Path _j)=[Prob (Path _i)-ProbLm (W ₁, W ₂, n)]+ProbLm (W ₁, W ₂, m) in, ProbLm (W ₁, W ₂, n), ProbLm (W ₁, W ₂, m) expression from all speech of node n, m with W1, the W2 ternary connects maximum probability, i.e. ProbLm (W ₁, W ₂, n)=MaxProb (W ₁, W ₂, W ₃) or ProbLm (W ₁, W ₂, m)=MaxProb (W ₁, W ₂, W ₃), W herein ₃Be all speech that can arrive, Prob (Path from node n or m _j), Prob (Path _i) the probability score of expression j and I paths;

Wherein set up a probability retrieval buffer zone, each is made up of this retrieval buffer zone ProbBuffer four elements: { W ₁, W ₂, W ₃, l, MaxLm}, W here ₁, W ₂, W ₃For the ternary of speech is right, l is a node number, and MaxLm in search procedure, need call speech W for needing the buffering probability of retrieval ₁, W ₂, with from the maximum ternary probability of all speech of node m the time, can at first retrieve at buffer zone:

1) in buffer zone, finds W ₁, W ₂, m then directly exports MaxLm,

2) in buffer zone, can not find W ₁, W ₂, m, but in buffer zone, can find W ₁, W ₂, n, wherein n is the father node of m, and satisfies Wmx＜=W ₃＜=Wmy then directly exports MaxLm, Wmx wherein, and the meaning of Wmy is existing explanation in right 2;

3) otherwise directly retrieve.

Wherein also set up a probability retrieval buffer zone, each is made up of this retrieval buffer zone four elements: ProbBuffer={W ₁, W ₂, W ₃, n, MaxLm}; In search procedure, need call ProbBLm (W ₁, W ₂, m) during function, can at first retrieve at buffer zone:

1) finds W at buffer zone ₁, W ₂, m then directly exports MaxLm;

2) can not find W at buffer zone ₁, W ₂, m, but in buffer zone, can find W ₁, W ₂, n, wherein n is the father node of m, and satisfies Wmx＜=Wmy, then directly exports MaxLm;

3) otherwise directly remove speech model retrieval ProbBLm.

Description of drawings

For further specifying technology contents of the present invention, below in conjunction with embodiment and accompanying drawing describes in detail as after, wherein:

Fig. 1 is the speech recognition system block diagram;

Fig. 2 is the graph of a relation of N-BEST and speech figure;

Fig. 3 is integrated prediction search framework figure;

Fig. 4 is the tree-shaped expression synoptic diagram of dictionary;

Fig. 5 is the mark of tone information in tree;

Fig. 6 is a multi-threshold cutting synoptic diagram;

Fig. 7 is the tree-shaped expression synoptic diagram of dictionary that is used to predict;

Fig. 8 is the retrieval synoptic diagram.

Embodiment

Conceptual illustration of the present invention as schematically shown in Figure 3.In this search framework, core still is the frame search algorithm of a time synchronized, and its input comprises the search dictionary, adds up three gram language model, three-tone model with tune, the prediction of speech and speech recognition features stream.Comparison diagram 2 can be seen the framework than Fig. 2, and Fig. 3 does not have the output of intermediate result; In Fig. 2, need simultaneously many cover acoustic models and language model, the simple model in front, the back is used complicated model again, and Fig. 3 then directly uses five-star acoustic model and language model.In this framework, the input of acoustic model is direct three-tone model with tune.Technical essential of the present invention is as follows:

1. search for the multi-threshold strategy of reducing

Multi-threshold is reduced synoptic diagram as shown in Figure 6.Set n thresholding P ₀, P ₁, P ₂..., P _n, P here ₀Be the most probable value in the current point in time path.Algorithm is as follows:

A) judge between a certain the scoring area that the path fell that must be divided into P, if i.e. P _i=＜P＞=P _I+1,

Then this paths thinks that to drop on i interval; This path, interval counter Ci adds 1;

B) for i=1 ..., N calculates the 1st to i interval accumulative total path number Si.

Si = Σ_{j = 1}^{i} Cj

Wherein Cj is a path counter of j song sword, finds satisfied

The minimum i of Si＞=CountThread then reduces thresholding and just is Pi.CountThread wherein

Be the activated path number that to control according to system's needs.

C) reduce the path according to the Pi thresholding.

Just can control the number of path that needs expansion more exactly by said process.The design of these thresholding empirical values can obtain by statistics.

2. dictionary tissue

Need fully to use the knowledge of language model to carry out the prediction of path score.Thereby the present invention added another one information especially, promptly begins to expand the set of the leaf node that can arrive from certain node n, i.e. the set of speech.In the close preceding node layer of root node, the leaf node that each node can extend and the set of speech are sizable, and it is unpractical directly writing down this set.In the present invention, the speech in original dictionary is renumberd, the principle of numbering is with the pairing speech ordering of dictionary leaf nodes unanimity dictionary.Utilize this ordering, just can adopt very compact structure that this is described, promptly write down first speech and last speech numbering that this node connects in the set of words and just can.So each node all has a Wx and Wy.As shown in Figure 7.Obviously if node m is by node n expansion, then must have:

W _Mx＜=W _Nx＜= _WnY＜=W _My(formula 5)

W wherein _Mx, W _MyBe speech scope from node m expansion; W _Nx, W _NyBe speech scope from node n expansion.

As mentioned above, the pattern number of simple or compound vowel of a Chinese syllable depends on the initial consonant on limit, the left and right sides, the simple or compound vowel of a Chinese syllable tone on the second from left right side two and the tone of itself.When search, when simple or compound vowel of a Chinese syllable of expansion, owing to the contextual information on the left side is known, but the contextual information on the right is unknown, so must expand in advance like this.According to above-mentioned, in fact need to expand in advance the back two-layer node, cause the rapid expansion of number of path like this.In this speech tree structure, the tone information of simple or compound vowel of a Chinese syllable is attached on the corresponding initial consonant of same syllable, making so only needs pre-expansion one deck node in search procedure, as shown in Figure 5.

3. language model prediction

Improved algorithm is exactly that any point of expanding at tree node in search procedure can add effective linguistry, thereby improves discrimination, reduces a large amount of time of volumes of searches cost to greatest extent.Calculate in the probability score that expands to path j from path i in such cases and to become:

Prob (Path _j)=[Prob (Path _i)-ProbLm (W ₁, W ₂, n)]+ProbLm (W ₁, W ₂, m) (formula 6) is at formula Prob (Path _j)=[Prob (Path _i)-ProbLm (W ₁, W ₂, n)]+ProbLm (W ₁, W ₂, m) in, ProbLm (W ₁, W ₂, n), ProbLm (W ₁, W ₂, m) expression from all speech of node n, m with W1, the W2 ternary connects maximum probability, promptly

ProbLm(W ₁，W ₂，n)＝MaxProb(W ₁，W ₂，W ₃)

Or ProbLm (W ₁, W ₂, m)=MaxProb (W ₁, W ₂, W ₃) (formula 7)

W herein ₃Be all speech that can arrive from node n or m.Prob (Path _j), Prob (Path _i) the probability score of expression j and I paths.The main points of formula 7 are new node of every expansion, just the language probability that approaches are most joined in the path, thereby in advance the language probability are joined in the search raising search speed and recognition accuracy for information about.

4. prediction probability retrieval

Above-mentioned ProbLm (W ₁, W ₂, retrieval m) probably need take for 20% time in continuous speech recognition.And, find function ProbLm (W by a large amount of observations in certain following period of time ₁, W ₂, m) retrieving W ₁, W ₂, the probability that the m parameter repeats is very big, and this repetition can be given to explain by Fig. 8: supposing has 5 paths at certain time point t, and its speech tree node n of living in is identical, historical speech w1, w2 is also identical, is HMM state difference of living in, promptly is in state 0 respectively, 1,2,3,4.Obviously at time t, the path that is in state 4 will expand to next node m.Then need to retrieve ProbLm (w1, w2, m).Then at time t+1: the path that is positioned s=3 will jump to state 4, and further expands to node m and also need to retrieve same probability.The original route that is in node 2 during T+2 will expand to node m equally.

Can find that in addition the path all expands by father node, and father node n the set out speech of three maximum meta-language probability in front may just drop in the scope that this node m expanded, at this moment:

ProbLm (W ₁, W ₂, m)=ProbLm (W ₁, W ₂, n), as Wnx＜=Wmx＜=W ₃＜=Wmy＜=Wny

Based on above-mentioned observation, set up a probability buffer zone, each is made up of this retrieval buffer zone ProbBuffer four elements: { W ₁, W ₂, W ₃, l, MaxLm}, W here ₁, W ₂, W ₃For the ternary of speech is right, l is a node number, and MaxLm need call speech W for the buffering probability of needs retrieval, in search procedure ₁, W ₂, with from the maximum ternary probability of all speech of node m the time, can at first retrieve at buffer zone:

1) in buffer zone, finds W ₁, W ₂, m then directly exports MaxLm,

2) in buffer zone, can not find W ₁, W ₂, m, but in buffer zone, can find W ₁, W ₂, n, its

Middle n is the father node of m, and satisfies Wmx＜=W ₃＜=Wmy then directly exports MaxLm,

Wmx wherein, the meaning of Wmy is existing explanation in right 2.

3) otherwise directly retrieve.

The invention has the advantages that: the shortcoming at above-mentioned searching algorithm is particularly set out at this Supersonic segment information of the tone that needs extra integrated Chinese in Chinese demand, disposable processing is carried out in the necessary various inputs of continuous speech recognition such as voice acoustic feature sequence, dictionary, acoustic model and language model, the recognition methods that draws word sequence optimum on the probability meaning can fully effectively utilize all utilizable knowledge sources, thereby reduce Search Error to greatest extent, improve search efficiency.

Benly be, though the foregoing invention explanation is to discern under the disposable search framework at Chinese continuous speech to realize that principle and algorithm are suitable for the search problem of any speech recognition.

Application examples of the present invention:

1. the effect of integrated prediction decoding in continuous speech recognition

This coding/decoding method and at first in large vocabulary continuous speech recognition system, realize and test about implementation algorithm.Total system comprises training utterance, corpus collection and processing, compositions such as acoustic training model, language model training, integrated prediction decoding and tone testing storehouse.Whole device is realized on the PC platform, includes general sound card and external microphone.

Wherein test library adopts country's " 863 " standard testing database, and this storehouse is made up of 6 men, 6 woman's pronunciations, and everyone pronounces 40, totally 480 sentences, sentence is selected from the Peoples Daily, adopts that discrimination improves 6% more than behind this searching algorithm, and recognition speed is then suitable substantially.

2. the application in interactive system

Existing completed application comprises travel information advisory system LoadStar; Hotel reservation system and restaurant translation help system, by replacing vocabulary, the present invention of alternate language model just can carry out the system transplantation in different task field very simply, can illustrate that also the present invention is irrelevant with concrete application, vocabulary and language model etc.System is made up of 5 modules such as speech recognition, language understanding, dialogue management, language response generation, phonetic syntheses.Wherein sound identification module includes the algorithm that adopts the present invention and realize.

Claims

1. the integrated prediction searching method of a Chinese continuous speech identification is characterized in that, the statistical language model of three-tone model with tune and ternary speech is carried out integration search once, and carry out the prediction of language model in decode procedure; The core algorithm of search adopts the synchronous multi-threshold cutting search of time frame, utilizes the special construction of dictionary and the detection that the ternary statistical model is predicted language model in search procedure;

3) if node m is obtained by node n expansion, then must have:

Si = Σ_{i = 1}^{i} Cj

C) reduce the path according to the Pi thresholding;

1) in buffer zone, finds W ₁, W ₂, m then directly exports MaxLm,

3) otherwise directly retrieve.

2, the integrated prediction searching method of Chinese continuous speech identification according to claim 1 is characterized in that, has wherein also set up a probability retrieval buffer zone, and each is made up of this retrieval buffer zone four elements: ProbBuffer={W ₁, W ₂, W ₃, n, MaxLm}; In search procedure, need call ProbBLm (W ₁, W ₂, m) during function, can at first retrieve at buffer zone:

1) finds W at buffer zone ₁, W ₂, m then directly exports MaxLm;

3) otherwise directly remove speech model retrieval ProbBLm.