CN100568222C

CN100568222C - Divergence elimination language model

Info

Publication number: CN100568222C
Application number: CNB021065306A
Authority: CN
Inventors: 朱云正; F·A·阿列瓦
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2001-01-31
Filing date: 2002-01-29
Publication date: 2009-12-09
Anticipated expiration: 2022-01-29
Also published as: CN1369830A

Abstract

A kind of language model that is used for language processing system such as speech recognition system constitutes with the function of relevant character, word phrase and contextual tagging.A kind of method and apparatus that is used to generate training corpus, this training corpus is used to train this language model, and a kind of system or module of using this disclosed language model.

Description

Divergence elimination language model

Background of invention

The present invention relates to the language modeling.More particularly, the present invention relates to create and use a kind of minimized language model of ambiguity that is used to make such as during the character recognition of input voice.

Speech recognition accurately not only needs a kind of acoustic model to select the said correct word of user.In other words, if speech recognition device must be selected or which word what determine to be pronounced is, if all words all have identical pronunciation, then this speech recognition device will obviously can not be carried out satisfactorily.A kind of language model provides a kind of and specified which word sequence in the vocabulary to be possible method or device, and is perhaps common, and the information of relevant various word sequence similaritys is provided.

Speech recognition often is counted as a kind of Language Processing form from top to bottom.Two kinds of general types of Language Processing comprise " from top to bottom " and " from bottom to top ".Language Processing is that largest unit with language begins to discern from top to bottom, and for example a sentence is handled by it being categorized as smaller unit, and for example phrase is further divided into littler unit, for example word successively.On the contrary, Language Processing is to begin with word from bottom to top, and begins to construct bigger phrase and/or sentence therefrom.Two kinds of forms of this of Language Processing can get help from language model.

A kind of known sorting technique is to use a kind of N character row language model.Because N character row can be made friends with lot of data, the correlativity of N word provides the superficial part structure of the compacting of sentence structure and semanteme usually.Although N character row language model for general oral instruction can carry out fine, homophone can produce very big mistake.Homophone is an element such as the such language codes of character or syllable, just pronounces similar but has one of two or more elements of different spellings.For example, when a user is just spelling character, because the character of some character pronunciation same voice identification modules meeting output errors.Same, for the character (for example " m " and " n ") that sounds that when pronouncing the kinds of characters sound identification module that is analogous to each other also can output error.

Ambiguity problem is especially general in as language such as Japanese or Chinese, and it mainly is to write with the Chinese character writing system.The character of these language is a lot of complicated expression sound and the pictograph of the meaning.These characters have formed limited syllable, produce a large amount of homophones successively, have increased greatly by the oral instruction required time of spanned file.Particularly, homophone character that hereof must identification error and insert correct homophone character.

Therefore there is a kind of lasting demand to go to develop new method, is used to make the ambiguity when sending out the voice of homophone and similar pronunciation to minimize with different meanings.Along with the development of technology, in more applications, all provide speech recognition, this just must obtain a kind of language model more accurately.

Summary of the invention

Speech recognition device uses a kind of language model as N character row language model to improve accuracy usually.First aspect of the present invention comprises and generates a kind of language model, and it is just discerning a character or a plurality of character (for example syllable) is for example particularly useful when spelling a word a talker.This language model helps homophone and sounds the ambiguity elimination of the kinds of characters that is analogous to each other.This language model is by the training corpus structure of the coherent element that comprises a character string (can be single character), a word phrase with character string (can be word) and a contextual tagging.Use a word list or dictionary, sentence or phrase that the word phrase of a character string by comprising word phrase, contextual tagging and word phrase for each forms a part can generate training corpus automatically.In another embodiment, each the word symbol for the word phrase generates a phrase.

Another aspect of the present invention is a kind of above-mentioned system or module that is used to discern the language model of said character of using.In conjunction with the contextual tagging in the relevant word phrase, sound identification module determines that the user is spelling or the mode of identification character when saying a character string.This sound identification module will only be exported the character that is identified, and not export contextual tagging or relevant word phrase.In yet another embodiment, the character that relatively is identified of sound identification module and the correct character that the recognized word phrase has been identified with checking.If the character that is identified is not in the recognized word phrase, then Shu Chu character is a character that is identified the word phrase.

Brief description of drawings

Accompanying drawing 1 is the block scheme of a language processing system.

Accompanying drawing 2 is block schemes of a typical computing environment.

Accompanying drawing 3 is block schemes of a typical speech recognition system.

Accompanying drawing 4 is process flow diagrams of a kind of method of the present invention.

Accompanying drawing 5 is the module frame charts that are used to realize the method for accompanying drawing 4.

Accompanying drawing 6 is block schemes of a kind of sound identification module and a kind of optional character authentication module.

The detailed description of illustrative embodiment

Accompanying drawing 1 shows a kind of language processing system 10, and it receives a language input 12, and handles this language input 12 so that a language output 14 to be provided.For example, this language processing system 10 can be embodied in the speech recognition system or the module of the language input 12 of a kind of reception or language of being write down said by the user.Language processing system 10 is handled said language and is provided with the identified word of literal output form and/or character as an output.

During handling, speech recognition system or module 10 can be visited a language model 16 to determine being which word, particularly, are which homophone in the said language or the element of other similar pronunciations.16 pairs of a kind of specific speech encodings of language model are as English, Chinese, Japanese or the like.In the embodiment that is released, language model 16 can be used as a kind of statistical language model, as a kind of N character row language model, and context-free grammar, or above-mentioned mixing, all these is well known in the art.A main aspect of the present invention is a kind of method of creating and constructing language model 16.Another main aspect is this method of using in speech recognition.

Going through before the present invention, summarizing earlier, operating environment is of great use.Accompanying drawing 2 and relevant argumentation provide the explanation of a brief overview to a suitable computing environment 20 that realizes place of the present invention.This computing environment 20 is an example of a suitable computing environment just, can't any restriction be arranged to usable range of the present invention or function.This computing environment 20 also should not be construed as and illustrates one of each several part in the typical operation environment 20 or synthesizes any dependence or demand.

The present invention operates with many other universal or special computingasystem environment or configuration.The example that is fit to known computing system, environment and/or the configuration of the present invention's use includes, but are not limited to personal computer, server computer, hand or laptop devices, multicomputer system is based on the system of microprocessor, set top box, programmable consumer electronic device, network PC, minicomputer, large scale computer comprises the distributed computing environment and the similar devices of any said system or equipment.In addition, the present invention can be used in the telephone system.

The present invention can illustrate with the context of common computer executable instructions, as program module, is carried out by computing machine.Usually, program module comprises routine, program, object, assembly, data structure or the like execution special duty or realizes special abstract data type.The present invention also is applied in the distributed computing environment, wherein by executing the task by the teleprocessing equipment of communication network link.In distributed computing environment, program module can be arranged in the computer-readable storage medium that comprises memory storage device of Local or Remote.Be illustrated below and by means of accompanying drawing by program and the performed task of module.

With reference to the accompanying drawings 2, be used to realize that canonical system of the present invention comprises that one is the universal computing device of form with computing machine 30.The formation of computing machine 30 includes, but are not limited to, and 40, one system storages 50 of a processing element and a coupling comprise each system unit of system storage to the system bus 41 of handling parts 40.This system bus 41 can be to comprise a memory bus or memory controller, an external bus and bus structure of using any several types of any one bus-structured local bus.In order to demonstrate, but be not limited thereto, this structure comprises industrial standard architectures (ISA) bus, Micro Channel Architecture (MCA) bus, ISA (EISA) bus that strengthens, VESA (VESA) local bus, and Peripheral Component Interconnect (PCI) bus that is called as the add-in card bus.

Computing machine 30 typically comprises one kind of multiple computer-readable mediums.Computer-readable medium can be any by computing machine 30 can be accessed available medium, comprise volatibility and non-volatile medium, movably with non-movably medium.For example, be not limited thereto, computer-readable medium can comprise computer-readable storage medium and communication media.Computer-readable storage medium be included in be used for storing such as the volatibility that any method and technology realized of computer-readable instruction, data structure, program module or other data with non-volatile, movably with non-movably medium.Computer-readable storage medium comprises, but be not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc memorys, magnetic tape cassette, tape, magnetic disk memory or other disk storage devices, or any other can be used to store information needed and can be by the medium of computing machine 20 visits.Communication media typically comprises computer-readable instruction, data structure, and program module or other data such as carrier wave or other transmission structures in modulated message signal, and comprise any information transmitting medium.Term " modulated message signal " is meant a kind of signal that comes the information in the coded signal that has one or more character group or changed by this way.As example, but be not limited to this, communication media comprises wire medium such as wired network or direct wired connection and wireless medium such as sound, FR, infrared ray and other wireless mediums.Above-mentioned combination in any can be included in the scope of computer-readable medium.

System storage 50 comprises that with volatibility and/or nonvolatile memory such as ROM (read-only memory) (ROM) 51 and random access memory (RAM) 52 be the computer-readable storage medium of form.A basic input/output 53 (BIOS) typically is stored among the ROM51, and it comprises basic routine, helps the information between each parts in the coordinate conversion computer 30, as between the starting period.RAM52 typically comprises can be accessed immediately and/or by the data and/or the program module of processing element 40 operations.As example, but be not limited to this, accompanying drawing 2 shows operating system 54, application program 55, other program modules 56 and routine data 57.

Computing machine 30 comprises that also other are removable/computer-readable storage medium of non-removable, volatile, nonvolatile.As just example, accompanying drawing 2 shows a kind of read or to its hard disk drive that writes 61 from non-removable, non-volatile magnetic medium, one from one removable, non-volatile magnetic disk 72 reads or removable, non-volatile CD 76 reads as CD ROM or other optical mediums or to its CD drive that writes 75 from one to its disc driver that writes 71 and one.Other can be used for the typical operation environment removable/non-removable, volatile/nonvolatile computer storage media comprises, but be not limited to magnetic tape cassette, flash card, digital versatile disc, digital recording band, solid-state RAM, solid-state ROM and like.Hard disk drive 61 typically links to each other with system bus 41 by the non-removable memory interface as interface 60, and disc driver 71 typically links to each other with system bus 41 by a removable memory interface as interface 70 with CD drive 75.

Above-mentioned discussion also is shown in the driver computer-readable storage medium relevant with them among Fig. 2 and provides computer-readable instruction, data structure, program module and other to be used for the memory of data of computing machine 30.In Fig. 2, for example, hard disk drive 61 diagrams store operating system 64, application program 65, other program modules 66 and routine data 67.Notice that these parts both can be the same or different in operating system 54, application program 55, other program modules 56 and routine data 57.Operating system 64, application program 65, other program modules 66 provide different marks with routine data 67 at this and illustrate, and at least, they are different versions.

User is by the input equipment as keyboard 82, and microphone 83 and one is such as the such pointing device 81 of mouse, tracking ball or touch pads, can input command and information in computing machine 30.Other input equipments

(not shown) can comprise operating rod, game mat, satellite retroreflector, scanner or a similar devices.These and other input equipments link to each other with processing element 40 by a user's input interface 80 with the system bus coupling usually, but also can be connected with bus structure by other interfaces, as parallel port, and game port or USB (universal serial bus) (USB).The display device of display 84 or other types also links to each other with system bus 41 by an interface such as video interface 85.Except display, computing machine can also comprise other peripheral output devices such as loudspeaker 87 and printer 86, and peripheral output device connects by an output Peripheral Interface 88.

Computing machine 30 uses this locality to be connected to one or more remote computers such as remote computer 94 and operates in the network environment.Remote computer 94 can be a personal computer, a handheld device, a station server, a router, a network PC, a statistics equipment or other common network node, and typically comprise the parts that many or all above-mentioned and computing machines 30 are relevant.The logic of describing among Fig. 2 connects and comprises a Local Area Network 91 and a wide area network (WAN) 93, can also comprise other networks.This networked environment is very general in office, enterprise-wide computer network, Intranet and the Internet.

When being used for the LAN networked environment, computing machine 30 is connected to LAN91 by a network interface or adapter 90.When being used for the WAN networked environment, computing machine 30 comprises that typically a modulator-demodular unit 92 or other are used for setting up the device of communication on WAN93, as the Internet.Modulator-demodular unit 92 can be built-in or external, links to each other with system bus 41 by user's input interface 80 or other suitable structures.In a network environment, the program module with computing machine 30 or some part correlation wherein of description can be stored in the remote memory storage device.As example, but be not limited to this, accompanying drawing 2 shows the remote application 95 that resides in the remote computer 94.It is typical that network shown in being appreciated that connects, and can use other to set up the device of intercomputer communication link.

An a kind of exemplary embodiments of speech recognition system 100 as shown in Figure 3.This speech recognition system 100 comprises microphone 83, a modulus (A/D) converter 104, a training module 105, characteristic extracting module 106, dictionaries store module 110, a sound model 112 along the senone tree, a tree-shaped search engine 114, language model 16 and a general language model 111.It should be noted that total system 100, or the part of speech recognition system 100, can in environment shown in Figure 2, realize.For example, microphone 83 passes through an appropriate interface, and can be used as an input equipment of computing machine 30 by A/D converter 104.Training module 105 and characteristic extracting module 106 both can be the hardware modules in the computing machine 30, also can be the software modules that is stored in the disclosed any information storing device of Fig. 2, and by processing element 40 or the visit of other suitable processor.In addition, dictionaries store module 110, sound model 112 and

language model

16 and 111 preferably also are stored in any memory device shown in 2.And tree-shaped search engine 114 is realized (can comprise one or more processors) or is carried out by the voice recognition processor of a special use being used by computing machine 30 in processing element 40.

In illustrated embodiment, during speech recognition, provide voice to input to microphone 83 by the user as system 100 with the form of audio sound signal.Microphone 83 conversion audio speech signals are an analog electronic signal, and offer A/D converter 104.A/D converter 104 these analog voice signals of conversion are the string number signal, and offer characteristic extracting module 106.In one embodiment, characteristic extracting module 106 is a kind ofly traditional digital signal is carried out spectral analysis and calculated the array processor of value for the frequency range of each frequency spectrum.In an illustrated embodiment, this signal offers characteristic extracting module 106 with the sample frequency of about 16kHz by A/D converter 104.

Characteristic extracting module 106 will be divided into the frame that comprises a plurality of numeral samples from the digital signal that A/D converter 104 receives.The duration of each frame approximately is 10 milliseconds.These frames are encoded into the proper vector of a plurality of frequency range spectrum signatures of reflection by characteristic extracting module 106 then.Under the situation of discrete and semicontinuous implication Markov model, characteristic extracting module 106 is also used the vector quantization technology and the encoding book that obtains from training data is encoded, and this proper vector is one or more coded words.Therefore, characteristic extracting module 106 provides its output to be used for each proper vector that what is said or talked about (or coded word).Characteristic extracting module 106 provides proper vector (or coded word) with the frequency of about per 10 milliseconds of proper vectors or (coded word).

Utilize of the distribution of the proper vector (or coded word) of just analyzed particular frame then according to implication Markov model calculating output probability.These probability distribution back are used in the treatment technology of carrying out a Veterbi decoding process or similar type.

After characteristic extracting module 106 receives coded word, tree-shaped search engine 114 visits are stored in the information in the acoustic model 112.This model 112 stores acoustic model, as the implication Markov model, and the speech features that its expression is detected by speech recognition system 100.In one embodiment, acoustic model 112 comprises a senone tree relevant with each Markov state in the implication Markov model.The implication Markov model is represented phoneme in an illustrated embodiment.Based on the senone in the acoustic model 112, tree-shaped search engine 114 is determined to be reached from the expression of the tongue of system user reception by the most similar phoneme of proper vector (or the code word) expression that receives from characteristic extracting module 106.

Tree-shaped search engine 114 is also visited the dictionary that is stored in the module 110.The information that is received by the tree-shaped search engine 114 based on the visit of its acoustic model 112 is used to search for dictionary memory module 110 and represents the coded word that receives from characteristic extracting module 106 or the word of proper vector to determine a most probable.Simultaneously, search engine 114 access language models 16 and 111.In one embodiment, language model 16 is N character rows that are used to discern a word of the most similar character represented by the input voice or a plurality of characters, and it comprises a character (a plurality of character), and contextual tagging and a word phrase are with identification character.For example, the input voice can be " N as inNancy ", and wherein " N " (also can be small letter) is required character, and " as in " is contextual tagging, and " Nancy " is that a word phrase relevant with character " N " is to illustrate or to discern required character.As for phrase " N as in Nancy ", the output of speech recognition system 100 may have only character " N ".In other words speech recognition system 100 determines that the user has selected to spell character after the input speech data of having analyzed about phrase " N as in Nancy ".Therefore, contextual tagging has been left in the basket from the text of output with relevant word phrase.Search engine 114 can be deleted context border note and relevant word phrase where necessary.

It should be noted that in this embodiment language model 111 is a word N character row that is used to discern the most similar word of being represented by the input voice of general oral instruction.For example, when speech recognition system 100 was embodied in a dictation system, language model 111 was provided for the indication of the most similar word of general oral instruction; Yet, when the user uses when having the phrase of contextual tagging, have high value than the language model 111 that is used for identical phrase from the output meeting of language model 16.The high value of language model 16 is used as an indication in the system 100 that the user just discerning with contextual tagging and word phrase.Therefore, for an input phrase with contextual tagging, the processing element of search engine 114 or other speech recognition systems 100 can be ignored contextual tagging and word phrase, and only exports required character.Below will continue to discuss the use of language model 16.

Although speech recognition system 100 described here is used HMM model and senone tree, be to be understood that this is an illustrative embodiment.One of skill in the art will recognize that speech recognition system 100 can have many forms, the feature that just is to use language model 16 that it is required also provides the said text of user as output.

As everyone knows, a kind of N of statistics character row language model is that a word produces a kind of probability estimation, supposes the word sequence up to that word.(promptly supposing the historical record H of word).N character row language model only considered among the historical record H influential preceding (n-1) the individual word of the probability of next word.For example, two array (or 2 arrays) language models are considered the influential previous word of next word.Therefore, in a N character array language model, the probability of a word appearance is expressed as follows:

P(w/H)＝P(w/w1，w2，…w(n-1)) (1)

Wherein w is an important speech;

W1 is positioned at the word that word w front is positioned at the n-1 position;

W2 is positioned at the word that word w front is positioned at the n-2 position; And

W (n-1) is first word of word w front in the sequence.

Simultaneously, the probability of a word sequence is determined based on the increase of the probability of the word of each given its historical record.Therefore, word sequence (w1 ... wm) probability is expressed as follows:

P 1 (w 1 Λwm) = Π_{i = 1}^{m} (P ({w_{i}}^{20} / H_{i})) - - - (2)

(set of phrase, sentence, statement interlude, paragraph or the like) obtains N character row model in N character row algorithm to the text training data corpus by using.A N character row algorithm can use, and for example, the statistical technique of knowing is the Katz technology for example, or binomial rear end distribution compensation technique.In these technology of use, this algorithm estimation word w (n) will follow a word sequence W1, W2 ..., the probability of W (n-1).These probable values form N character row language model jointly.The following stated some aspect of the present invention can be used to construct the statistics type N character row model of a standard.

First main aspect of the present invention as shown in Figure 4, as method 140 that is used for language processing system with the language model of pointing character of a kind of establishment.Also can be with reference to Fig. 5, a kind ofly comprise system or install 142 with the module that is used for implementation method 140 with instruction.Usually, method 140 comprises, for each word phrase of a word phrase table, interrelates with the contextual tagging of representing this character string of identification at step 144 character string and this word phrase with this word phrase.It should be noted that this character string can comprise a single character.Equally, a word phrase can comprise a single word.For example, for a character string and a word phrase that equals a word that equals a character, step 144 interrelates the contextual tagging that is used for each word in character of this word and the word list 141.Word in the language-specific that contextual tagging is normally used by the talker or word phrase are with a language element in the identified word phrase.The example of contextual tagging comprises " as in " in English, " for example ", " as found in ", " like ", " such as " or the like.Similarly word or word phrase also can find in other language, for example in the Japanese with Chinese in.In one embodiment, step 144 comprises the corpus of constructing a word phrase 143.Each word phrase comprises a character string, word phrase and contextual tagging.Typically, when a single character is related with a word, use first character, although also can use another character of this word.The example of this word phrase comprises " N as in Nancy ", " P as in Paul " and " Z as in zebra ".

In another embodiment, another character of word is related with word and contextual tagging, and in some language, for example Chinese, wherein many words include only one, two or three character, and this is helpful to each character of this word is associated with word in the contextual tagging.As noted above, related required character is to form identical word phrase with a kind of simple method of corresponding word and contextual tagging.Therefore, 141, one of the given word lists corpus that is used for the word phrase 143 of train language model can be easy to generate all required contextual taggings.

Based on corpus 143, language model 16 utilizes conventional construction model 146 to construct, and as a N character row tectonic model, realizes the known technology that is used to construct language model 16.Piece 148 is illustrated in structure language model 16 in the method 140, and wherein language model 16 includes, but are not limited to, a N character row language model, a context-free grammar or above-mentioned mixing.

The phrase that generates can a designated suitable numerical value, will produce a suitable probable value according to the formation of language model.In above-mentioned example, " N as in Nancy " more may be said than phrase " N as in notch ".Therefore, another feature of the present invention comprises each relevant character string in the language model and word phrase correction probability score.The probability score can be regulated at the generation of language model 16.In another embodiment, the probability score can produce a suitable probable value by being included in character and the word phrase that the identical word phrase of sufficient amount is proofreaied and correct to being correlated with in the corpus 143 in language model.Probable value also can be a function of the likelihood of word phrase use.Usually, exist than the more frequent word phrase of discerning a character or a plurality of characters of other words uses.This word phrase can provide a higher probable value designated or in addition in language model.

Fig. 6 shows a kind of sound identification module 180 and language model 16.Sound identification module 180 can be a above-mentioned type; Yet, be to be understood that sound identification module 180 is not limited to this embodiment, but a lot of forms can be arranged.As above-mentioned specified, whether data and access language model 16 that sound identification module 180 receives expression input voice comprise the phrase with contextual tagging with definite input voice.When detecting the word phrase with contextual tagging, 180 of sound identification modules provide single or multiple characters relevant with the word phrase with contextual tagging as output, rather than contextual tagging or word phrase.In other words, although sound identification module detects complete phrase " N as in Nancy ", sound identification module also only provides " N " as output.This output is particularly useful in dictation system, and wherein the talker selects individually to indicate required single or multiple characters.

In this, should be noted that above-mentioned language model 16 is made of relevant character string, word phrase and contextual tagging in fact, therefore allow 16 pairs of language models to have the input voice induction of this form.In the embodiments of figure 3, general language model 111 can be used as the input voice of the particular form that does not have character string, word phrase and contextual tagging.Yet, be to be understood that

language model

16 and 111 can merge if necessary in these two kinds of embodiment.

For the reception of input voice with to the visit of language model 16, sound identification module 180 is determined the character string of an identification and the word phrase of an identification for the input voice.In many situations, the character string of identification will be because use language model 16 will be correct.Yet, In yet another embodiment, can comprise a character authentication module 182 and proofread and correct at least a portion in the mistake that causes by sound identification module 180.Identification string and identified word phrase and the character string of relatively identification and the word phrase of identification that 182 visits of character authentication module are determined by sound identification module 180, particularly, the character string of checking identification is present in the word phrase of identification.If the character string of identification is not in the word phrase of identification, mistake has clearly taken place, although should mistake both may be derived from the talker, also may be that sound identification module 180 has been misread the character string of identification or the word phrase of identification owing to dictated wrong phrase as " M as in Nancy ".In one embodiment, character authentication module 182 can suppose that therefore this mistake more may, substitute the character string of identification with the character in the word phrase that is present in identification in the character string of identification.Replace the character string of identification can relatively carrying out with the character of the word phrase of identification according to the similarity of sound between the character in the word phrase of the character string of identification and identification.Therefore, character authentication module 182 can be visited when belonging to the independently storage data of the sound of character.Utilization is present in the character in the word phrase of identification, the voice data of each character that character authentication module 182 is relatively stored in the word phrase of identification and the character string of identification.Provide immediate character as output.To those skilled in the art, character authentication module 182 can be included in the sound identification module 180; Yet for the purpose of explaining, character authentication module 182 is separately to illustrate.

Although the present invention is described with reference to preferred embodiment, those of ordinary skills can make various modifications under the situation that does not deviate from spirit and scope of the invention in form and description.

Claims

1. system and discern the input voice with the method for pointing character of using a computer by language model, this method comprises:

Create language model, comprising:

Structure training corpus comprises:

For comprising, character and this word phrase of this word phrase is associated to generate the contextual tagging phrase of this training corpus automatically with the contextual tagging of this character of expression identification based on each the word phrase in the word phrase dictionary of Chinese character; And

Use this training corpus to create described language model;

Character when identification is spoken comprises:

Reception has the input voice of character string, wherein character string comprises the contextual tagging phrase that comprises character, comprises contextual tagging phrase based on Chinese character, has word phrase and contextual tagging based on Chinese character, and wherein said contextual tagging has been indicated the ambiguity of delete character;

Under the situation that does not have prompting, detect the contextual tagging phrase of importing in the voice that receives;

The instruction that execution conducts interviews to described language model, wherein said language model comprise N character row language model, and this model has the probabilistic information of the contextual tagging phrase that is generated; And

Export this character with text, and do not have this word phrase and contextual tagging of the contextual tagging phrase that detected.

2. the process of claim 1 wherein that this language model comprises a context-free grammar.

3. the process of claim 1 wherein that association comprises that first character with each word phrase is associated with this word phrase.

4. the process of claim 1 wherein that association comprises another character with at least a portion word phrase, rather than first character, be associated with corresponding word phrase.

5. the process of claim 1 wherein that each character that association comprises at least a portion word phrase is associated with corresponding word phrase.

6. the process of claim 1 wherein that association comprises is associated each character of each word phrase with corresponding word phrase.

7. the method for claim 1 also is included as each relevant character and word phrase correction probability score in the language model.

8. the method for claim 10, wherein contextual tagging comprises " " in the Japanese.

9. the process of claim 1 wherein contextual tagging comprise in the Chinese " ".

10. the process of claim 1 wherein that each word phrase all is a word that comprises at least one character.

11. the process of claim 1 wherein that output character comprises character string output as the probability function that is stored in the language model.

12. the method for claim 11, wherein output character comprises the function of character output as the N character row probability of receive input voice.

13. the method for claim 11, wherein output character comprises the unique function of character output as receive input voice.

14. the method for claim 11, when the character of identification was not present in the word phrase of identification, the character that is output was a character of the word phrase of identification.

15. a computer system that is used to discern the input voice comprises:

The language model of indication contextual tagging phrase and described contextual tagging phrase probabilistic information, wherein said contextual tagging phrase are basically by the relevant character string based on Chinese character, word phrase and the contextual tagging with character string; And

Be used to receive the identification module that the data of voice are imported in indication, described identification module detects the existence of the contextual tagging phrase in the input voice that receives under the situation that does not have the pointing character string as the prompting of text; Visit described language model; And according to the probabilistic information in the language model, output based on the character string of Chinese character as text that at least some detected by the said contextual tagging phrase of user.

16. computer system as claimed in claim 15, it is characterized in that, described identification module is handled the contextual tagging phrase that is detected in the mode that is different from other input voice, and described processing is by only exporting the character string in the detected contextual tagging phrase.

17. computer system as claimed in claim 15 is characterized in that, described language model comprises statistical language model.

18. computer system as claimed in claim 15 is characterized in that, described language model comprises N character row language model.

19. computer system as claimed in claim 15 is characterized in that, described language model comprises the context-free language model.

20. computer system as claimed in claim 15 is characterized in that, described language model output string is as the function that institute's identification string and institute's identified word phrase are compared.

21. computer system as claimed in claim 20 is characterized in that the character string of being discerned does not exist in the word phrase of being discerned, the described character string that is output is the character string of the word phrase discerned.

22. computer system as claimed in claim 15 is characterized in that, each described word phrase is a word.

23. computer system as claimed in claim 21 is characterized in that, each described character string is single character.

24. computer system as claimed in claim 15 is characterized in that, each described character string is single character.