CN1398395A

CN1398395A - Global approach for segmenting characters into words

Info

Publication number: CN1398395A
Application number: CN99817082.8A
Authority: CN
Inventors: 阎永红(音译); 托凌云(音译); 林志伟(音译); 张向东(音译); 罗伯特·勇
Original assignee: Intel Architecture Development Shanghai Co Ltd; Intel Corp
Current assignee: Intel Architecture Development Shanghai Co Ltd; Intel Corp
Priority date: 1999-12-23
Filing date: 1999-12-23
Publication date: 2003-02-19
Anticipated expiration: 2019-12-23
Also published as: WO2001048738A1; AU1767200A; CN1192354C

Abstract

In some embodiments, the invention includes a method. The method involves creating a path list of segmentation paths of characters using a vocabulary. A probability of a first segmentation paht is determined and designated as the best segmentation path. The probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path. In some embodiments, the invention is an apparatus including a computer readable medium that performs such a method.

Description

Dividing word is the global approach of speech

Technical field

The present invention relates to speech recognition system, or rather, relate in speech recognition system some strokes are divided into speech.

Background technology

A part in the speech recognition device is a language model.Catch a kind of common methods of given language syntactic structure, be to use conditional probability to catch the orderly information that embeds in the speech string of sentence.For example,, can construct a language model if current speech is W1, represent some other speech W2, W3 ... Wn can follow the probability of W1.The probability of these speech can adopt following mode to represent: the probability that P21 can follow speech W1 for speech W2, wherein P21=(W2|W1).With this representation, P31 can follow the probability of speech W1 for speech W3; The probability that P41 can follow speech W1 for speech W4, and the like, Pn1 can follow the probability of speech W1 for speech Wn.P21, P31 ... maximal value among the Pn1 can be determined and be used in the language model.Aforesaid example is for the binary probability, although also can calculate the ternary conditional probability.

The generation of language model is often by the conditional probability of some speech in investigation literary works (such as newspaper) and the definite vocabulary with respect to other speech in the vocabulary.

In some language, such as Chinese and Japanese, speech can be written as the word of one or more character types, for example Chinese character in the Chinese and the Chinese character in the Japanese.Sentence is made up of word string, and speech wherein implies, because do not have at interval between the speech of adjacency.A specific word may oneself itself be exactly a speech, and perhaps the word with its front or back (also possibility while and front and back) combines to form a speech.When producing speech word how in conjunction with or separate, the meaning of speech has variation.Yet in writing form, not at interval,, perhaps form this speech between word and the word with another word or a plurality of word so whether a specific word oneself itself is exactly a speech, visually also not obvious.And which speech a specific word belongs to understand from context.For to the language model applied statistical method, adopt the mode at interval of on the border of speech, placing, speech is extracted clearly.

It is to be undertaken by " greedy algorithm " traditionally that stroke is divided into speech.Greedy algorithm may further comprise the steps:

(1) from the starting point of the given sentence that will handle, all possible speech that the word string start-up portion is complementary in exhaustive and the sentence.

(2) afterbody that picks up the longest speech (just, having the speech of maximum numbers of words) and the substring that is complementary in sentence is placed an interval, and all the other word strings are treated as a new sentence, and repeating step (1) all word processings in sentence finish.

From the viewpoint of the overall situation, greedy algorithm is not to make best selection.In fact, the combination that it is selected may be neither optimum be correct on the also non-sentence structure.As people such as T.Cormen in " Introduction to Algorithms " (The MIT Press, 1990) 329 pages say: " greedy algorithm always is made in this moment and seems best selection.Just, it is made local optimum and selects, and wishes that this selection can cause globally optimal solution.”

Summary of the invention

In certain embodiments, the present invention includes a kind of method.This method comprises a path list that uses certain vocabulary to produce the stroke sub-path.Determining to divide the probability in path and specify it for one first is the optimum division path.Determine that another one divides the probability in path and the probability in it and optimum division path is compared.If the probability in other division path surpasses the probability in optimum division path, just the optimum division path is appointed as in other division path.Repeat that this way divides up to all remaining that paths all obtain determining and finish with the likelihood ratio in optimum division path.

In certain embodiments, the present invention is a kind of device, comprises a kind of computer-readable medium, and it carries out this method.In more other embodiment, the present invention is a kind of computer system.

Introduce additional embodiment and prescription below.

Brief Description Of Drawings

Will be more fully understood the present invention from the accompanying drawing of the detailed introduction given below and the embodiment of the invention, but, the specific embodiment that they should not introduced as limiting the invention to, and only be in order to explain and to understand.

Fig. 1 is the high-level schematic block diagram of a computer system of expression, and some embodiment of the present invention can be together with this system of use.

Fig. 2 is the high-level synoptic diagram of a handheld computer system, and some embodiment of the present invention can be together with this system of use.

Embodiment

The present invention relates to a kind of system and method from the stroke participle.Just, the present invention relates to determine which speech a word should belong to.The present invention have with some language such as relevant with Japanese, the specific application of Chinese, the interval of speech division do not represented in these language between word and word.But the present invention is not limited to this type of purposes.Disclosed the present invention is designed to, and speech division preferably made in given any sentence.The language model of doing generation like this is better than the model that classic method above introduction, that use greedy algorithm obtains.Language model can cause recognition accuracy preferably preferably, because it has described this language preferably with regard to the speech string.

In certain embodiments, the present invention uses the dynamic programming algorithm execution division that statistical language model is equipped with.The mode that can carry out dynamic algorithm has a variety of.An example of dynamic algorithm is as follows.At first, calculate the n gram language model by traditional greedy algorithm and handle main body (promptly will be divided into the word of speech).Then, use the Viterbi algorithm to repartition this sentence.The Viterbi algorithm is a kind of dynamic programming, and it can be used for global optimization." Introduction to Algorithms " (The MIT Press, 1990) 301-328 page or leaf referring to people such as T.Cormen.The Viterbi algorithm that we use can be described as following (1) formula:

{Pw}_{i} = \max_{i} ({Pw}_{i - 1} + prob (w_{i} | w_{i - 1})) - - - - - - - (1)

In (1) formula, P is a probability, and " prob " comprises this language model.In (1) formula, w _iBe i speech, w _I-1For near w _iPrevious speech, Pw _I-1Be w _I-1The probability that individual speech occurs, prob (w _i| w _I-1) be if speech w _I-1During appearance, speech w _iThe conditional probability that occurs.(1) formula relates to and finds to make the maximized speech w of (1) formula _iBy finding the solution (1) formula, word sequence (w0w1 as a result ... wN) will guarantee that under the selected meaning that is divided in maximum likelihood be best.In certain embodiments, work as i=N, when arriving the sentence ending, have global maximum.

(1) formula is a binary form, but, if other form is arranged in language model, such as ternary or unary form, also can use.Can also other technology of using compensation weighted sum.

As mentioned above, in some language, each word oneself itself just may be a speech.Yet, the present invention relates to determine that word can combine with other word to form other speech, still be alone that speech is better.The speech of being made up of a plurality of words also can be called term or phrase.

A kind of version of greedy algorithm provides as follows with pseudocode form:

Read vocabulary; // vocabulary is the tabulation of possible speech

Open the language main body; // language main body comprises the word that will be divided into speech

When (not being the ending of language main body)

{

From the language main body, read delegation and put into row buffer;

// row buffer is a storage stack, is not limited to any specific

Form

When (row buffer non-NULL)

{

Find with row buffer head coupling, the longest speech in the vocabulary;

Export this speech and a speech separator;

From row buffer, remove the head of coupling;

}

The output line Separator;

}

Close the closed language main body;

In certain embodiments, according to of the present invention, use language model one in partitioning algorithm may further comprise the steps: read language model; // language model is loaded in the storer or alternate manner makes

The available vocabulary of reading; Open the language main body; { from the language main body, read delegation and put into row buffer when (not being the ending of language main body);

Number of words in the // delegation can change according to embodiment; Delegation can

Can be one and use vocabulary, produce the path list that comprises all possible division path;

// one is divided the path is a kind of possible stroke branch; Can use different shapes

Formula is deposited the path, and for example tabulation or tree construction are found the division path of greed and it is saved as optimal path;

// can use multiple greedy algorithm such as above provide a kind of; Of the present invention

Among this embodiment,

// greed is divided the path and is regarded as optimal path at first, but also can use it

Its initial path uses language model to calculate the probability in this path, and this value is changed to maximum probability;

The general of another speech followed in probability and a speech that // language model specifies speech to take place

Rate.Can use (1) formula or another

// formula calculating probability { is selected the path and it is changed to current path when (path list non-NULL) from path list; Use language model, calculate the probability of current path; (if the probability＞maximum probability of current path) {

The probability of maximum probability=current path;

Current path saves as optimal path;

}

From path list, remove current path;

}

The output optimal path;

}

Close the closed language main body;

In conjunction with the Chinese words in following, provide an example of this algorithm.

Urtext:

Way is arranged

Solve

Use the division result of greedy method:

Way is arranged

Solve

Use the division result of language model:

Way is arranged

Solve

Example 1.

During correct the division, the meaning of this sentence is " having way and strength to deal with problems ".The present invention has successfully divided this sentence, and traditional method is not accomplished.

In example 1, urtext is considered as following eight words forms in order: C1, C2, C3, C4, C5, C6, C7 and C8.From urtext, visually also unclear how word the grouping to form speech.Following table 1 has provided two kinds of possible modes that the word grouping formed five speech W1-W5.

Table 1:

Speech	According to prior art greedy algorithm, the word that comprises in the speech	According to the present invention, the word that comprises in the speech
Speech			????W1	????C1	????C1
????W2	????C2C3	????C2C3	????W1	????C1	????C1
????W2	????C2C3	????C2C3	????W3	????C4C5	????C4
????W4	????C6	????C5C6	????W3	????C4C5	????C4
????W4	????C6	????C5C6	????W5	????C7C8	????C7C8

It is as follows to use a kind of greedy algorithm generation greed to divide the path.In main body, in the vocabulary of consecutive word, be exactly the speech that has only word C1 with word C1 the longest initial speech.In other words, C1C2 is not the speech in the vocabulary.So speech W1 is exactly word C1.In certain embodiments, speech W1 leaves row buffer, and next word becomes capable head, although this is an implementation detail that need not illustrate.In this example, next word is C2.In main body, in the vocabulary of consecutive word, be the speech that comprises word C2C3 with word C2 the longest initial speech.In other words, C2C3 is in vocabulary, but C2C3C4 does not exist.So speech W2 is exactly word C2C3.In main body, in the vocabulary of consecutive word, be the speech that comprises word C4C5 with word C4 the longest initial speech.So speech W3 is exactly word C4C5.In main body, in the vocabulary of consecutive word, be the speech that comprises word C6 with word C6 the longest initial speech.So speech W4 is exactly word C6.In main body, in the vocabulary of consecutive word, be the speech that comprises word C7C8 with word C7 the longest initial speech.So speech W5 is exactly word C7C8.

Calculate this greed and divide the probability in path.For speech W1 and W2 and word C1, C2 and C3, division that comprise in the vocabulary, only path is the path of being selected by greedy algorithm.A kind of method of handling this situation is not recomputate probability, but is not also not calculate another kind of probability when existing other path that vocabulary allows.Another kind method is to recomputate the probability in same path, only can determine that they are identical, makes current path not replace maximum probability.

Yet,, have two kinds of paths for speech W3 and W4.First kind is that greedy algorithm is selected, and W3 is C4C5, and W4 is C6.The division path that another kind of vocabulary is allowed is, W3 is C4, and W4 is C5C6.In this example, the probability that the combination of supposing to follow C5C6 in the C4 back is being followed C6 than the combination back of C4C5 is bigger.(W5 is identical in each case.) so in (1) formula, the probability of current path can be divided the probability in path greater than greed, it can replace greed and divide the path.The possibility that merits attention below the attention.The combination of supposing C4C5 is bigger than the probability of C4 oneself.According to this single bit of information, can select greed to divide the path.Yet this can not cause preferably the overall situation to be separated, and is bigger because the probability that C5C6 following C6 than C4C5 back is being followed in the C4 back.

Row can be a sentence.As usage herein, term " sentence " is meant with the one group continuous speech of a symbol such as the fullstop ending.In different embodiment, in dividing the path, can consider not word on the same group.For example, all words in the sentence can be considered in the division path.Divide the path and can consider a mobile word window, and do not consider the sentence ending, notice that only language model does not allow the word of a sentence ending to combine with first word in the next sentence.Window may be a word of setting number.If the last character in previous path not in speech, from its initial new division path, is divided the path and may be comprised X word.Other possibility also exists.

There is various computing systems can be used for training and speech recognition system.Only be that Fig. 1 represents the high-level schematic of computer system 10 as an example, this system comprises processor 14, storer 16 and I/O and control assembly 18.Storer 16 may comprise row buffer 22.Row buffer only is a storage stack, needn't have any specific feature.For example, it needn't have adjacent memory unit.Have jumbo storer in processor 14, storer 16 may both be represented the not storer on processor 14 chips, and expression part is at the part storer on processor 14 chips not again.(perhaps storer 16 may be fully on processor 14 chips.) in certain embodiments, row buffer 24 is in processor 14, yet row buffer and nonessential in processor 14.In addition, be not that each embodiment of the present invention has row buffer.Dividing the path does not need to leave in the row buffer.At least some I/O and control assembly 18 may be on the same chips of processor 14.Perhaps on another chip.Microphone 26, monitor 30, annex memory 34, input equipment (such as keyboard and mouse 38), network be connected 42 and loudspeaker 44 may be mutual with I/O and control assembly 18.The multiple storer of storer 34 expressions is such as hard disk drive and CD ROM or DVD disc.These comprise computer-readable medium, and they can be held instruction, and carry out these instructions some embodiment of the present invention is taken place.It is emphasized that Fig. 1 only is schematically, the invention is not restricted to the purposes of this type of computer system.Be used to realize that computer system 10 of the present invention and other computer system may be various ways, such as desktop, main frame and pocket computer.

For example, Fig. 2 has shown the handheld device 60 that has display screen 62, and it may contain some or all characteristic of Fig. 1.This handheld device often may be the interface of another computer system, such as the system among Fig. 1.The shape of the object among Fig. 1 and Fig. 2 and relative size are not actual shape and relative size of hint.

Out of Memory and embodiment

The quality of language model is to measure with the confusion degree of puzzling traditionally, and it is a kind of entropy tolerance of language complexity.For identical training and evaluation body of text, the model with low puzzled confusion degree is better than the high model of puzzled confusion degree.As an experiment, use the data in 94 years to 98 years of People's Daily, the ternary model that different division methods estimate is estimated.The puzzled confusion degree of tradition (greed) method is 182, and the result of the embodiment of the invention is 143.Compared with prior art, this is the remarkable improvement of simulation accuracy.

Mention " embodiment ", " embodiment ", " some embodiment " or " other embodiment " in this manual, mean that a kind of specific characteristic, structure or feature together with the embodiment introduction is included at least among some embodiment, but need not to be all embodiment of the present invention.The multiple form of expression " embodiment ", " embodiment " or " some embodiment " needn't refer to same embodiment.

If this instructions declare " can ", " perhaps " or " possibility " comprise certain assembly, characteristic, structure or feature, is not to comprise this specific assembly, characteristic, structure or feature just.If mention " certain " key element in this instructions or claims, and do not mean that this key element has only one.If mention " certain is other " key element in this instructions or claims, not getting rid of has not only other key element.

Those skilled in the art obtains to will appreciate that after the interests of this open file, within the scope of the present invention, can produce many other changes from above introduction and accompanying drawing.Therefore, be that claims following, that comprise any other modification are stipulated scope of the present invention.

Claims

1. method comprises:

(a) use certain vocabulary to produce a path list of stroke sub-path;

(b) determining to divide the probability in path and specify it for one first is the optimum division path;

(c) probability of determining another one division path determines also whether the probability in other division path surpasses the probability in optimum division path, if like this, the optimum division path is appointed as in just that this is other division path,

Repeating (c) divides up to all remaining that paths all obtain determining and finishes with the likelihood ratio in optimum division path.

2. according to the method for claim 1, it is characterized in that first obtains by greedy algorithm.

3. according to the method for claim 1, it is characterized in that, divide the path and leave in the row buffer, and after having compared corresponding probability, from row buffer, remove.

4. according to the method for claim 1, it is characterized in that the word that comprises in the division path is those words in the single sentence.

5. according to the method for claim 1, it is characterized in that the word that comprises in the division path is in the window of certain slip.

6. according to the method for claim 1, it is characterized in that, determine probability by using language model.

7. according to the method for claim 1, it is characterized in that, determine probability by the calculating that relates to following formula:

P_{w_{i}} = \max_{i} (P_{w_{i - 1}} + prob (w_{i} | w_{i - 1}))

, w wherein _iBe i speech, w _I-1For near w _iPrevious speech, Pw _I-1Be w _I-1The probability that individual speech occurs, prob (w _i| w _I-1) be if speech w _I-1During appearance, speech w appears _iConditional probability.

8. device comprises:

A kind of computer-readable medium wherein contains instruction, makes computer system when carrying out these instructions:

(a) use certain vocabulary to produce a path list of stroke sub-path;

9. device according to Claim 8 is characterized in that, first obtains by greedy algorithm.

10. device according to Claim 8 is characterized in that, divides the path and leaves in the row buffer, and remove from row buffer after having compared corresponding probability.

11. device according to Claim 8 is characterized in that, the word that comprises in the division path is those words in the single sentence.

12. device according to Claim 8 is characterized in that, the word that comprises in the division path is in the window of certain slip.

13. device according to Claim 8 is characterized in that, determines probability by using language model.

14. device according to Claim 8 is characterized in that, determines probability by the calculating that relates to following formula:

P_{w_{i}} = \max_{i} (P_{w_{i - 1}} + prob (w_{i} | w_{i - 1}))

, w wherein _iBe i speech, w _I-1For near w _iPrevious speech, Pw _I-1Be w _I-1The probability that individual speech occurs, prob (w _i| w _I-1) be if speech w _I-1During appearance, speech w _iThe conditional probability that occurs.

15. device according to Claim 8 is characterized in that, this device is an one disc.

16. a computer system comprises:

Preserve and divide the storer that forms the word path list of speech in the vocabulary;

Processor, it

(a) determining to divide the probability in path and specify it for one first is the optimum division path;

(b) probability of determining another one division path determines also whether the probability in other division path surpasses the probability in optimum division path, if like this, just the optimum division path is appointed as in other division path,

Repeating (b) divides up to all remaining that paths all obtain determining and finishes with the likelihood ratio in optimum division path.

17. the device according to claim 16 is characterized in that, first obtains by greedy algorithm.

18. the device according to claim 16 is characterized in that, divides the path and leaves in the row buffer, and remove from row buffer after having compared corresponding probability.

19. the device according to claim 16 is characterized in that, the word that comprises in the division path is those words in the single sentence.

20. the device according to claim 16 is characterized in that, the word that comprises in the division path is in the window of certain slip.

21. the device according to claim 16 is characterized in that, determines probability by using language model.

22. the device according to claim 16 is characterized in that, determines probability by the calculating that relates to following formula:

P_{w_{i}} = \max_{i} (P_{w_{i - 1}} + prob (w_{i} | w_{i - 1}))