CN1387651A

CN1387651A - System and iterative method for lexicon, segmentation and language model joint optimization

Info

Publication number: CN1387651A
Application number: CN00815294A
Authority: CN
Inventors: 王海峰; 黄常宁; 李凯夫; 狄硕; 蔡东峰; 秦立峰; 郭建峰
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 1999-11-05
Filing date: 2000-11-03
Publication date: 2002-12-25
Anticipated expiration: 2020-11-03
Also published as: WO2001037128A2; WO2001037128A3; JP2003523559A; US6904402B1; US20040210434A1; AU4610401A; CN100430929C

Abstract

A method for optimizing a language model is presented comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved.

Description

The system of dictionary, segmentation and language model joint optimization and alternative manner

The application requires the temporary patent application No.60/163850 in proposition on November 5th, 1999 by the present inventor, the right of priority of " An iterative method for lexicon, wordsegmentation and language model joint optimization ".

Technical field

The present invention relates to the language modeling, more specifically to the system and the alternative manner of dictionary, literal segmentation and language mode combined optimization.

Background technology

Recently computing power and development of technologies have promoted the development of powerful application software of new generation, comprise web browser, word processing and speech recognition application programming interface.For example, after two or three initial characters of input domain name, the web browser of latest generation is expected the input of URL(uniform resource locator) (URL) address.Word processor provides improved spelling and syntax check ability, Word prediction and language conversion.Newer speech recognition application programming interface provides similarly has the identification making us admiring and the various features of precision of prediction.For useful, must realize these features substantially in real time to the terminal user.For this performance is provided, many application programs rely on tree data structure and set up simple language model.

Briefly, language mode is measured the likelihood of specifying sentence arbitrarily.That is, language model can obtain the sequence (literal, character, letter etc.) of any clauses and subclauses and estimate the possibility of this sequence.A kind of common approach of setting up existing language mode is the training set according to known text corpus (textual corpus), utilizes the prefix tree data structure to set up N-gram (N letter group) language model.

The use of prefix tree data structure (also being called suffix tree or PAT tree) makes advanced application travel through language model fast, and above-described real-time substantially performance characteristic is provided.Briefly, the N-gram language model is counted in the whole text occurrence number of specific project (literal, character etc.) in a string (size is for N's).Count value is used to calculate the probability of use of this string.Usually, tri-gram (N-gram, N=3 here) method comprises the steps:

(a) text corpus is divided into some projects (character, letter, numeral etc.);

(b) according to less predetermined dictionary and simple predetermined segment algorithm, to described some projects (for example character (C)) segmentations (for example being divided into speech (W)), each W is mapped to one or more C in tree data structure here;

(c) by the occurrence number of counting character string,, take this to predict a series of speech (W by first two words according to corpus (corpus) train language model that separates ₁, W ₂... W _M) probability:

P(W ₁，W ₂，W ₃，...W _M)≈∏P(W _i|W _i-1，W _i-2) (1)

There is limitation in the N-gram language model aspect some.At first, the counting procedure that uses in the structure prefix trees is very consuming time.Thereby in fact can only realize less N-gram model (being generally 2-gram or 3-gram).Secondly, along with the increase of the string length (N) of N-gram language model, the required storer of storage prefix trees is by 2 ^NIncrease.Thereby, for N-gram greater than 3 (they being 3-gram), the required storer of storage N-gram language model, and utilize the required access time of bigger N-gram language model very big.

The N-gram language model of prior art tends to use the dictionary of fixing (less), undue simple segmentation algorithm, and generally only depending on preceding two words predicts current word (with regard to the 3-gram model).

It is general or be exclusively used in the ability of the best word of task that fixing dictionary has limited Model Selection.If a certain word is not present in the dictionary, then with regard to related model, this word does not exist.Thereby less dictionary can not cover the language content of expection.

Segmentation algorithm is comparatively special usually, and is not based on any statistics or semantic principle.Abandon less word too simple segmentation algorithm common fault and adopt bigger word.Thereby this model can not be predicted the less word that is included in the semantically acceptable big character string exactly.

Because the result of above-mentioned restriction uses the language model of prior art dictionary and segmentation algorithm often to be easy to make mistakes.That is, any mistake that produces at dictionary or in the segmentation stage is transmitted in the whole language model, thereby has limited the accuracy and the prediction attribute of language model.

At last, model be confined to contextual maximum two formerly word (with regard to the 3-gram language model) be restricted property equally perhaps need more context because will predict the possibility of word exactly.The limitation of this three aspect of language model causes the forecast quality of this language model relatively poor usually.

Thereby, need a kind of defective and circumscribed obstruction relevant usually that be not subjected to the language modeling technique of prior art, be used for the system and method for dictionary, segmentation algorithm and language model joint optimization.A solution just as they were is provided below.

Summary of the invention

The present invention relates to the system and the alternative manner of dictionary, segmentation and language model joint optimization.In order to overcome the limitation relevant with prior art, the present invention does not rely on predetermined dictionary or segmentation algorithm, in the iterative process of optimizing language model, dynamically generates dictionary and segmentation algorithm on the contrary.According to a kind of realization, a kind of method of improving the language model performance is provided, comprise that the dictionary and the segmentation that obtain according to the text corpus that utilizes the maximum match technology to receive form initial language model, by dynamically updating dictionary according to Statistics and to text corpus segmentation again, the initial language model of refining repeatedly is till reaching the predictive ability threshold value.

Description of drawings

Identical index number is used to represent identical assembly and feature in the accompanying drawing.

Fig. 1 is the block scheme that embodies the computer system of the present invention's instruction;

Fig. 2 is the block scheme that a kind of iteration that realizes according to the present invention forms the illustration modeling agency of dictionary, segmentation and language model;

Fig. 3 is the diagrammatic representation according to the DOMM tree of one aspect of the invention;

Fig. 4 is a process flow diagram of setting up the methodology of DOMM tree;

Fig. 5 is the process flow diagram of the methodology that is used for dictionary, segmentation and language model joint optimization of instruction according to the present invention;

Fig. 6 describes in detail according to the initial dictionary of a kind of generation that realizes of the present invention, and changes dictionary, segmentation and the language model that dynamically produces repeatedly, the process flow diagram of the method step till assembling;

Fig. 7 is the storage medium with some executable instructions of the alternative according to the present invention, and described some executable instructions realize innovation modeling agency of the present invention when being performed.

Embodiment

The present invention relates to the system and the alternative manner of dictionary, segmentation and language model joint optimization.In explanation process of the present invention, quoted the language model of innovation, dynamic order Markov model (DOMM).The U.S. Patent application No.09/XXXXXX of Lee of pending trial etc. at the same time, provide the detailed description of DOMM in " A Method and Apparatus for Generating andManaging a Language Model Data Structure ", the disclosure of this patented claim is contained in this as a reference.

In the discussion here, the present invention of explanation generally speaking who is carried out by one or more conventional computing machines in the executable instruction of computing machine such as program module.In general, program module comprises the routine carrying out special duty or realize particular abstract, program, object, assembly, data structure etc.In addition, those skilled in the art will recognize and can utilize other Computer Systems Organization, comprises handheld apparatus, personal digital assistant, multicomputer system, puts into practice the present invention based on consumption electronic product microprocessor or programmable, network PC, microcomputer, mainframe computer etc.In distributed computer environment, program module not only can be arranged in local memory storage but also can be arranged in remote storage.But it is to be noted under the situation that does not break away from the spirit and scope of the present invention, also can make amendment the architecture and the method for explanation here.

The computer system of illustration

Fig. 1 graphic extension comprises the exemplary computer system 102 according to the innovation language modeling agency 104 of instruction combined optimization dictionary of the present invention, segmentation and language model.Though recognize in Fig. 1, to be described to independent application program, but language modeling agency 104 also can be implemented as application program, for example a kind of function of word processor, web browser, speech recognition system etc.In addition, though be described to software application, but those of skill in the art will recognize and also can realize this innovation modeling agency, for example programmable logic array (PLA), application specific processor, special IC (ASIC), microcontroller etc. in hardware.

According to following explanation, obviously computing machine 102 is to be used for representing the general of any classification or dedicated computing platform, described computing platform realizes the instruction of the present invention that realizes according to first illustration of introducing above when the language modeling agency (LMA) 104 that is endowed innovation.Though recognize here language modeling agency is described as application software, but computer system 102 selectively supports the hardware of LMA 104 to realize.In this respect, for the explanation of LMA 104, the description of following computer system 102 only is illustrative, because under the situation that does not break away from the spirit and scope of the present invention, the better or more weak computer system of availability performance is replaced.

As shown in the figure, computing machine 102 comprises one or more processors 132, system storage 134 and makes the various system components that comprise system storage 134 and bus 136 that processor 132 couples.

In several bus structure of bus 136 representatives any one or multiple bus structure, comprise memory bus or Memory Controller, peripheral bus, any one bus-structured processor or local bus in Accelerated Graphics Port and the various bus structure of use.System storage comprises ROM (read-only memory) (ROM) 138 and random-access memory (ram) 140.For example comprise in starting process, the basic input/output (BIOS) 142 that helps to transmit the basic routine of information between the element in computing machine 102 is kept among the ROM 138.Computing machine 102 also comprises the hard disk drive 144 that the hard disk (not shown) is read and write, to the disc driver 146 of removable disk 148 read-writes and the CD drive 150 that removable CD 152 such as CD ROM, DVD ROM or other optical medium is read and write.Hard disk drive 144, disc driver 146 and CD drive 150 link to each other with bus 136 by scsi interface 154 or some other suitable bus interface.These drivers and their relevant computer-readable mediums provide the non-volatile memories of computer-readable instruction, data structure, program module and other data for computing machine 102.

Though illustration environment described herein adopts hard disk 144, movably disk 148 and CD 152 movably, but those skilled in the art will be appreciated that the computer-readable medium that also can use other type that can preserve the accessible data of computing machine in the operating environment of illustration, for example magnetic tape cassette, flash memory card, digital video disc, random-access memory (ram) ROM (read-only memory) (ROM) or the like.

Some program modules can be kept on hard disk 144, disk 148, CD 152, ROM 138 or the RAM 140, comprise operating system 158, comprise one or more application programs 160, other program module 162 and the routine data 164 (the language model data structure that for example obtains at last etc.) of the innovation LMA104 that embodies the present invention's instruction.The user can be by the input media such as keyboard 166 and pointing device 168 order and information input computing machine 102.Other input media (not shown) can comprise microphone, operating rod, game mat, dish, scanner or the like.These and other input media is by being connected with processor 132 with interface 170 that bus 136 couples.The display device of monitor 172 or other type also links to each other by the Interface ﹠ Bus such as video adapter 174 136.Except monitor 172, personal computer generally includes other peripheral output devices (not shown) such as loudspeaker and printer.

As shown in the figure, computing machine 102 is utilizing and one or more remote computers, for example works in the networked environment that the logic of remote computer 176 connects.Remote computer 176 can be another person's computing machine, personal digital assistant, server, router or other network equipment, network " thin client (thin-client) " PC, peer device or other common network node, and above generally comprising with respect to some or all element of computing machine 102 explanation, but in Fig. 1 a graphic representation storer 178.

As shown in the figure, the logic of describing among Fig. 1 connects and comprises Local Area Network 180 and wide area network (WAN) 182.This networked environment is very usual in office, enterprise-wide. computer networks, Intranet and the Internet.In one embodiment, remote computer 176 is carried out such as by Washington, the Microsoft Corporation of Redmond produces the Internet Web browser program of " Internet Explorer " of also supply and marketing and so on, so that visit and utilize online service.

In the time of in being used in the lan network environment, computing machine 102 links to each other with LAN (Local Area Network) 180 by network interface or adapter 184.In the time of in being used in the WAN network environment, computing machine 102 generally comprises with the wide area network 182 such as the Internet and sets up modulator-demodular unit 186 or other device of communicating by letter.Modulator-demodular unit 186 (can be built-in also can be external) links to each other with bus 136 by I/O (I/O) interface 156.Except network connectivty, I/O interface 156 is also supported one or more printers 188.In networked environment, can be kept in the remote memory with respect to the program module of personal computer 102 or its various piece explanation.Recognize that it is illustrative that represented network connects, and can use other means that establish a communications link between computing machine.

In general, by the data processor programming of the instruction in the various computer-readable recording mediums that are saved in computing machine at different time to computing machine 102.Program and operating system for example generally are distributed on floppy disk or the CD-ROM.Program and operating system are mounted or are loaded into the supplementary storage of computing machine from floppy disk or CD-ROM.During execution, they partly are loaded in the main electronic memory of computing machine at least.When these and other various types of computer-readable recording medium and microprocessor or other data processor comprised the instruction that realizes the innovative step that the following describes or program together, invention described herein comprised such computer-readable recording medium.When computing machine itself was programmed according to method that the following describes and technology, the present invention also comprised this computing machine.In addition, can be to some subassembly programming of computing machine, so that carry out function and the step that describes below.When according to described during to the programming of these subassemblies, the present invention also comprises such subassembly.In addition, invention described herein comprises the data structure on the various storage mediums of being included in that the following describes.

For convenience of explanation, here program and other executable program assembly, for example operating system is expressed as the program block of separation, but will recognize that such program and assembly reside in different the time on the different memory units of computing machine, and is carried out by the data processor of computing machine.

The language modeling agency of illustration

Fig. 2 graphic extension embodies the block scheme of illustration language modeling agency (LMA) (104) of the present invention's instruction.As shown in the figure, language modeling agency 104 is made up of analysis engine 204, storer 206 and optional one or more HELPER APPLICATIONS (for example graphic user interface, predicted application program, verifying application programs, estimation application program etc.) 208 of one or more controllers 202, innovation.They link to each other by communication as shown in the figure.Though recognize in Fig. 2, to be described as some different parts, but one or more function element of LMA 104 also can combine.In this respect, under the situation that does not break away from the spirit and scope of the present invention, can adopt the modeling agency of the dynamic dictionary of more complicated or better simply iteration combined optimization, segmentation and language model.

Shown in as above indirect, though be described as independent function element, LMA 104 also can be realized as more advanced application, for example a kind of function of word processor, web browser, speech recognition system or languages switching system.In this respect, 202 pairs of one or more directive commands from father's application program of the controller of LMA 104 are reacted, and call the feature of LMA104 selectively.On the other hand, LMA 104 also can be implemented as independent language modeling tool, provides the user interface (208) of the feature that selectively realizes LMA 104 described below to the user.

Under any situation, the controller 202 of LMA 104 calls one or more functions of analysis engine 204 selectively, thereby optimizes language model according to the dictionary and the segmentation algorithm of dynamic generation.Thereby except being configured to realize the instruction of the present invention, controller 202 is used for representing any one control system in some alternative control system as known in the art, includes but is not limited to microprocessor, programmable logic array (PLA), micro computer, special IC (ASIC) or the like.In alternative realization, controller 202 is used for representing a series of executable instruction that realizes above-mentioned steering logic.

As shown in the figure, the analysis engine 204 of innovation by Markov probability calculation device 212, comprise frequency computation part subroutine 213, dynamically the dictionary data structure generator 210 and the data structure storage manager 218 that generate subroutine 214 and dynamic segmentation subroutine 216 constitutes.When receiving outside indication, a certain example that controller 202 calls analysis engine 204 selectively forms, revises and optimize statistical language model (SLM).More particularly, opposite with existing language modeling technique, analysis engine 204 produces the statistical language model data structure according to the Markov transition probability between the single project (for example character, letter, numeral etc.) of text corpus (for example one or more groups text) substantially.In addition, as what will illustrate, analysis engine 204 utilizes data as much as possible (being called " linguistic context (context) " or " ordering (order) ") to come the probability of computational item string.In this respect, language model of the present invention is fittingly called dynamic order Markov model (DOMM).

Call when setting up the DOMM data structure when controlled device 202, analysis engine 204 calls data structure generator 210 selectively.In response, data structure generator 210 is set up and is made up of some nodes (relevant with each project in some projects), and represents internodal subordinative tree data structure.As mentioned above, tree data structure is called DOMM data structure or DOMM tree here.Controller 202 receives text corpus, and at least a subclass of text corpus is saved in the storer 206 as dynamic training set 222, will produce language model according to dynamic training set 222.Recognize in alternative, also can use predetermined training set.

In case receive dynamic training set, frequency computation part subroutine 213 is fetched a subclass of training set 222 at least for analysis.Frequency computation part subroutine 213 is determined the frequency of occurrences of concentrated each project (character, letter, numeral, word etc.) of training set zygote.According to internodal dependency, data structure generator 210 is set each allocation of items to DOMM suitable node, and frequency values (C is arranged _i) indication and comparison position (b _i).

Markov probability calculation device 212 is according to the probability of linguistic context (j) computational item (character, letter, numeral etc.) of relevant item.More particularly, according to instruction of the present invention, the Markov probability (C of specific project _i) depend on the character formerly as much as possible of data " permission ", in other words:

P(C ₁，C ₂，C ₃，...，C _N)≈∏P(C _I|C _I-1，C _I-2，C _I-3，...，C _J) (2)

Markov probability calculation device 212 is different from character C as the number of characters of linguistic context (j) _i, C _I-1, C _I-2, C _I-3Deng " dynamically " quantity of each sequence.According to a kind of realization, the number of characters that depends on linguistic context (j) that Markov probability calculation device 212 calculates depends in part on the frequency values of each character at least, i.e. their ratios of occurring in whole text corpus.More particularly, if under the situation of the project of determining text corpus, the minimum frequency of occurrences of Markov probability calculation device 212 uncertain at least specific projects then owing to uncorrelated with statistics, may be wiped out it (promptly getting rid of) from tree data structure.According to an embodiment, the low-limit frequency threshold value is three (3).

Shown in as above indirect, analysis engine 204 does not rely on fixed lexicon or simple segmentation algorithm (they all are easy to make mistakes).On the contrary, analysis engine 204 calls dynamic segmentation subroutine 216 selectively project (for example character or letter) branch bunchiness (for example word).More precisely, segmentation subroutine 216 is divided into subclass (bulk) to training set 222, and poly-degree (being that a kind of of similarity between project measures in the subclass) in calculating.Segmentation subroutine 216 is carried out the calculating of segmentation and cohesion repeatedly, till the interior poly-degree of each subclass reaches predetermined threshold.

Dictionary generates subroutine 214 and is called, thereby dynamically generates dictionary 220 and it is saved in the storer 206.According to a kind of realization, dictionary generates subroutine 214 and analyzes segmentation result, and surpasses the string generation dictionary of threshold value according to the Markov transition probability.In this respect, dictionary generates subroutine 214 and produces dynamic dictionary 220 according to the string above the predetermined Markov transition probability that obtains from the one or more language models that produced by analysis engine 204.Therefore, be different from the existing language model that depends on error-prone known fixed dictionary, analysis engine 204 is according to the one or more language models that form in a period of time, produce statistical significance more important, add up the dictionary of string accurately.According to an embodiment, dictionary 220 is included in and forms in the follow-up language model, " the virtual corpus " that Markov probability calculation device 212 is relied on (except that dynamically training is gathered).

Thereby revise when being called or when utilizing DOMM language model data structure, analysis engine 204 calls an example of data structure storage manager 218 selectively.According to an aspect of the present invention, data structure storage manager 218 utilizes system storage and extended memory to preserve the DOMM data structure.More particularly, following below with reference to Fig. 6 and 7 be described in more detail like that, data structure storage manager 218 adopts WriteNode subroutine and ReadNote subroutine (not shown) the node subclass of most recently used DOMM data structure to be saved in the one-level cache memory 224 of system storage 206, simultaneously the node that seldom uses is recently transferred in the extended memory (for example disk file in hard disk drive 144 or some remote actuator), thereby improved performance characteristic is provided.In addition, the l2 cache memory of system storage 206 is used to gather write command, and till reaching predetermined threshold value, in this threshold point, set WriteNode order is sent in the appropriate location of data structure storage manager in storer.Though be described as independently function element, but person of skill in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, data structure storage manager 218 also can be combined into the function element of controller 202.

The data structure of illustration-dynamic order Markov model (DOMM) tree

Fig. 3 represents the schematic diagram of the illustration dynamic order Markov model tree data structure 300 of instruction according to the present invention.For how bright DOMM tree data structure 300 from the principle constitutes, Fig. 3 has provided by The English alphabet, promptly A, B, C ... the illustration DOMM data structure 300 of the language model of Z-shaped one-tenth.As shown in the figure, DOMM tree 300 comprises one or more root nodes 302 and one or more slave node 304, these nodes are relevant with a project (character, letter, numeral, word etc.) of text corpus, and connected with the dependency between the expression node by logic.According to a realization of the present invention, root node 302 is made up of a project and a frequency values (for example the count value of how many times appears in this project in text corpus).Certain one deck under root node layer 302, slave node is arranged to the y-bend subtree, and wherein each node comprises a relatively position (b _i), the associated project of this node (A, B ...) and the frequency values (C of this project _N).

Thereby from the root node relevant with item B 306, the y-bend subtree is made up of the slave node 308-318 of the relation between the expression node and their frequency of occurrences.Given this principle example will be appreciated that from root node, for example node 306 beginning, and the complexity of searching of DOMM tree is near log (N), and N is the sum of the node that will search for.

As above indirectly shown in, the big I of DOMM tree 300 surpasses the free space in the primary memory of the storer 206 of LMA 104 and/or computer system 102.Therefore, data structure storage manager 218 is convenient to cross over primary memory (for example 140 and/or 260) DOMM data tree structure 300 is saved in the storage space of expansion, for example such as the hard disk drive 144 of computer system 102 in the disk file on the main storage means.

The operation of illustration and realization

Introduced function of the present invention and notion element with reference to figure 1-3, acted on behalf of 104 operation below with reference to the language modeling of Fig. 5-10 explanation innovation.

Set up the DOMM data tree structure

Fig. 4 is according to an aspect of the present invention, sets up the process flow diagram of the methodology of dynamic order Markov model (DOMM).Shown in as above indirect, language modeling agency 104 can directly be called by user or advanced application.In response, the controller 202 of LMA 104 calls an example of analysis engine 204 selectively, and text corpus (for example one or more document) is loaded in the storer 206 as dynamic training set 222, and is divided into subclass (sentence for example, verses etc.), square frame 402.In response, data structure generator 210 is given node in the data structure each allocation of items of this subclass, and calculates the frequency values of this project, square frame 404.According to a kind of realization,, then call the frequency of occurrences that frequency computation part subroutine 213 is determined each project in the training set zygote collection in case data structure generator has utilized this subclass to fill this data structure.

In square frame 406, data structure generator determines whether to exist other subclass of training set, if then read next subclass at square frame 408, and continue this process at square frame 404.In alternative realization, before calling frequency computation part subroutine 213, this data structure is filled on data structure generator 210 each subclass ground.In alternative, the frequency computation part subroutine is only counted each project when it is placed into the interdependent node of data structure.

If in square frame 406, each project that data structure generator 210 adds training set 222 for fully data structure 300, data structure generator 210 deleted data structure optionally then, square frame 410.The data structure 300 that can adopt some kinds of mechanism deletions as a result of to obtain.

The methodology of dictionary, segmentation and language model joint optimization

Fig. 5 is the process flow diagram of the methodology of dictionary, segmentation and the language model joint optimization of instruction according to the present invention.As shown in the figure, this method starts from square frame 400, in square frame 400, calls LM 104, and sets up the prefix trees of a subclass of the text corpus that receives at least.More particularly, as shown in Figure 4, modeling agency 104 data structure generator 210 is analyzed the text corpus that receives, and selects a subclass as the training set at least, builds upright DOMM tree jointly according to this training set.

In square frame 502, set up a very big dictionary according to prefix trees, this dictionary is carried out pre-service, thereby remove some obvious illogical word.More particularly, call dictionary and generate subroutine 214, set up initial dictionary according to prefix trees.According to a kind of realization, utilize its length less than a certain predetermined value, for example all substrings of ten (10) individual projects (i.e. slave node from the root node to the maximum, this substring are 10 nodes or less than 10 nodes) are set up initial dictionary according to prefix trees.In case compilation is finished initial dictionary, dictionary generates subroutine 214 and reduces the number of this dictionary (for example referring to following square frame 604) by deleting some obvious illogical word.According to a kind of realization, dictionary generates subroutine 214 and appends on the predetermined dictionary gathering the new initial dictionary that produces according to the training of the text corpus that receives at least.

In square frame 504, utilize initial dictionary at least to the training set segmentation of the text corpus that receives.More particularly, call dynamic segmentation subroutine 216, produce initial segmentation text corpus at least to the training set segmentation of the text corpus that receives.Person of skill in the art will appreciate that and have various methods, for example regular length segmentation, maximum match or the like the segmentation of training text corpus.Also do not producing under the situation of statistical language model (SLM) according to the text corpus that receives, dynamic segmentation subroutine 216 utilizes the maximum match technology that initial segmentation text corpus is provided for this reason.Therefore, segmentation subroutine 216 starts from the starting point of string (the perhaps branch of DOMM tree), and the search dictionary, checks initial project (I ₁) whether be (one-item) " word ".The segmentation subroutine makes up this project and next project in the string subsequently, finds combined result (I for example whether to understand in this dictionary form with " word " ₁I ₂), and the like.According to a kind of realization, the longest string (I of the project that in dictionary, finds ₁, I ₂... I _N) be considered to the correct segmentation of this string.Recognize that under the situation that does not break away from the spirit and scope of the present invention, segmentation subroutine 216 can be utilized more complicated maximum matching algorithm.

After forming initial dictionary and segmentation according to the training text corpus, enter iterative process, wherein dictionary, segmentation and language model be by combined optimization, square frame 506.More particularly, as what below will be described in more detail, the iteration optimization of innovation adopts the statistical language modeling method dynamically to adjust segmentation and dictionary, thereby the language model of optimization is provided.That is, be different from existing language modeling technique, modeling agency 104 does not rely on the predetermined static dictionary, and perhaps undue simple segmentation algorithm produces language model.On the contrary, modeling agency 104 utilizes the text corpus that receives, and perhaps utilizes a subclass (training set) of the text corpus that receives dynamically to produce dictionary and segmentation at least, thereby produces the language model of optimizing.In this respect, the language model of modeling agency 104 generations does not exist common and existing modeling relevant defective and limitation.

After the process of innovation of introducing among Fig. 5, Fig. 6 provides the more detailed process flow diagram that produces initial dictionary according to a kind of realization of the present invention, thus and the iterative process of refining dictionary and segmentation optimization language model.As before, this method starts from setting up according to the text corpus that receives the step 400 (Fig. 4) of prefix trees.As mentioned above, can utilize whole text corpus, perhaps utilize a subclass (being called training corpus) of whole text corpus to set up prefix trees.

In square frame 502, the process that produces initial dictionary starts from square frame 602, and wherein dictionary generates subroutine 214 and has substring (the perhaps branch of prefix trees) less than the project of predetermined number by identification, produces initial dictionary according to prefix trees.According to a kind of realization, dictionary generates the substring that subroutine 214 is determined ten (10) individual projects or is less than 10 projects, thereby constitutes initial dictionary.In square frame 604, dictionary generates subroutine 214 at the initial dictionary that obvious illogical substring analysis produces in step 602, remove these substrings from initial dictionary.That is, dictionary generates subroutine 214 and analyzes illogical or impossible word in the initial dictionary substring, and removes these words from dictionary.For initially deleting, call dynamic segmentation subroutine 216 at least to the training set segmentation of the text corpus that receives, produce the corpus of segmentation.According to a kind of realization, maximum matching algorithm is used to carry out segmentation according to initial dictionary.Call frequency analysis subroutine 213 subsequently, calculate the frequency of occurrences of each word in the text corpus that receives in the dictionary, and dictionary is classified according to the frequency of occurrences.Determine the word that frequency is minimum and from dictionary, delete this word.Can determine the deletion and the threshold value of segmentation again according to the size of corpus.According to a kind of realization, the corpus of 600M project can utilize 500 frequency threshold to be comprised in this dictionary.Like this, can from initial dictionary, delete most obvious illogical words.

In case produce and delete initial dictionary in step 502, the then text corpus segmentation of receiving according to initial dictionary butt joint, square frame 504 to small part.As mentioned above, according to a kind of realization, utilize the maximum match method to finish the initial fragment of text corpus.

In case finish initial dictionary and text corpus fragmentation procedure, the iterative process that then dynamically changes dictionary and segmentation begins according to the text corpus that receives (perhaps training set) optimization statistical language model (SLM), square frame 506.As shown in the figure, this program starts from square frame 606, and wherein Markov probability calculation device 212 utilizes initial dictionary and segmentation to bring into use the segmentation text corpus to carry out the language model training.That is, given initial dictionary and initial fragment can produce statistical language model by it.Though should notice that language model does not have benefited from the dictionary of refining and based on the segmentation (this will develop into following step) of statistics, language model is from as the basis basically with the text corpus that receives.Thereby, though initial language model.

In square frame 608, after carrying out the original language model training, utilize based on the segmentation of SLM text corpus (perhaps training set) segmentation again to segmentation.Known sentence w1, w2 ... under the situation of wn, there be the possible approach (M 〉=1) of M kind to its segmentation.Dynamic segmentation subroutine 216 is calculated each segmentation (S according to the N-gram statistical language model _i) probability (p _i).According to a kind of realization, segmentation subroutine 216 utilizes tri-gram (being N=3) statistical language model to determine the probability of any given segmentation.Adopt the Viterbi searching algorithm to find out most probable segmentation S _k, here:

S _k＝arg?max(p _i) (3)

In square frame 610, utilize the text corpus of the segmentation again that obtains by above-mentioned segmentation to upgrade dictionary based on SLM.According to a kind of realization, modeling agency 104 calls frequency assignation subroutine 213 and calculates the frequency of occurrences of each word in the text corpus that receives in the dictionary, according to the frequency of occurrences dictionary is classified.Determine the word that frequency is minimum, and it is deleted from dictionary.Subsequently when recomputating the single counting of all these words, must this word occur and be divided into less word again.Can determine this deletion and the threshold value of segmentation again according to the size of corpus.According to a kind of realization, the corpus of 600M project can be utilized as 500 frequency threshold and be comprised in this dictionary.

In square frame 612, upgrade language model, with the dictionary that reflects dynamic generation with based on the segmentation of SLM, the measuring of Markov probability calculation device 212 computational language model confusions (being that opposite probability is measured).If confusion continues to assemble (convergence 0), promptly improve, then continue this program at square frame 608, in square frame 608,, revise dictionary and segmentation again further improving under the situation of language model performance (measuring) intentionally with confusion.If in square frame 614, determine the modification recently of dictionary and segmentation is not improved language model, then determine further at square frame 616 whether confusion has reached acceptable threshold value.If then this program stops.

If but language model does not also reach acceptable confusion threshold value, then at square frame 608, dictionary generates subroutine 214 and delete the minimum word of the frequency of occurrences in corpus from dictionary, at square frame 618 this word is divided into littler word again, and program proceeds to square frame 610.

According to the above description, recognize with on statistics at least based on the dictionary of the dynamic generation of the subclass that receives corpus and chopping rule as prerequisite, the language modeling of innovation agency 104 produces the language model of optimizing.In this respect, compare with existing language model, the language model that obtains at last has improved calculating and predictive ability.

Alternative

Fig. 7 stores some instructions on it according to another embodiment of the present invention, comprises the block scheme of the storage medium of the instruction that realizes innovation modeling agency of the present invention.In general, Fig. 7 illustrates the storage medium/device 700 with the some executable instruction 702 that is stored thereon, and described executable instruction 702 comprises at least when being performed, and realizes a subclass of innovation modeling agency's 104 of the present invention instruction.When being carried out by the processor of main system, executable instruction 702 realizes the modelings agency, produces for carrying out on main system or being applicable to that otherwise the statistical language model of text corpus of any main frame use of other application program of main system represents.

Storage medium 700 used herein is to be used for representing in some memory storages known to those skilled in the art and/or the storage medium any one, for example volatile storage, Nonvolatile memory devices, magnetic-based storage media, optical storage medium or the like.Similarly, executable instruction is to be used for reflecting in some software languages as known in the art any one, for example C++, Visual Basic, HTML (HTML), Java, extension markup language (XML) or the like.In addition, recognize storage medium/device 700 needn't with any main system colocated.That is, storage medium/device 700 can reside in and the executive system coupled in communication, and can be performed in the remote server of system's visit.Therefore, the software of Fig. 7 is realized being counted as illustrative, realizes within the spirit and scope of the present invention because can expect alternative storage medium and software.

Though aspect the language of architectural feature and/or method step, the present invention is being described, is understanding that the present invention who limits need not be confined to illustrated concrete feature or step in additional claim.On the contrary, just as the illustration form of the invention that realizes prescription these concrete feature and steps are disclosed.

Claims

1, a kind of method comprises:

Form initial language model according to the dictionary and the segmentation that obtain by the corpus that receives; With

By according to Statistics, dynamically upgrade dictionary and to corpus segmentation again, refining original language model repeatedly is till reaching the predictive ability threshold value.

2, in accordance with the method for claim 1, the step of wherein setting up initial language model comprises:

Generate the prefix trees data structure according to the project of decomposing from the corpus that receives;

Determine N project or less than the substring of N project according to the prefix trees data structure;

Utilize the substring of determining to fill described dictionary.

3, in accordance with the method for claim 2, wherein N equals 3.

4, in accordance with the method for claim 1, wherein the step of iterate improvement original language model comprises:

By determining the probability of occurrence of each segmentation, to described corpus segmentation again.

5, in accordance with the method for claim 4, wherein utilize the N-gram language model to calculate the probability of occurrence of determining segmentation.

6, in accordance with the method for claim 5, wherein the N-grim language model is the 3-gram language model.

7, in accordance with the method for claim 4, wherein utilize two segmentations formerly to calculate the probability of occurrence of determining segmentation.

8, in accordance with the method for claim 4, wherein the step of iterate improvement language model comprises:

Corpus according to segmentation again upgrades dictionary.

9, in accordance with the method for claim 8, wherein upgrading dictionary comprises:

Determine the frequency of occurrences of each word in the corpus that receives of dictionary; With

The minimum word of the determined frequency of deletion from dictionary.

10, in accordance with the method for claim 9, also comprise:

The word of deletion is divided into two or more less words again, and utilizes the word of segmentation again to upgrade dictionary.

11, in accordance with the method for claim 8, also comprise:

The dictionary that utilize to upgrade and the corpus of segmentation again, the prediction of computational language model is measured.

12, in accordance with the method for claim 11, predict that wherein measuring is the language model confusion.

13, in accordance with the method for claim 11, also comprise:

Determine whether the predictive ability of language model is enhanced owing to upgrade and the result of segmentation again; With

If predictive ability is enhanced, then carry out other renewal and segmentation again, till determining not further improvement.

14, in accordance with the method for claim 1, wherein utilize the maximum match technology to obtain the original language model.

15, in accordance with the method for claim 1, wherein predictive ability is quantized and is expressed as confusion and measures.

16, in accordance with the method for claim 15, wherein improve language model, up to confusion measure be lowered to be lower than acceptable prediction threshold value till.

17, in accordance with the method for claim 1, also comprise:

In application program, utilize improved repeatedly language model to predict the possibility of another corpus.

18, in accordance with the method for claim 17, wherein said application program is one or more of spelling and/or grammar checker, word-processing application, language translation application program, speech recognition application programming interface etc.

19, a kind of storage medium that comprises some executable instructions, described executable instruction comprise at least when being performed, and realize subset of instructions in accordance with the method for claim 1.

20, a kind of computer system comprises:

Wherein preserve the memory storage of some executable instructions;

Couple with described memory storage, carry out the subset of instructions of described some executable instructions at least, thereby realize performance element in accordance with the method for claim 1.

21, a kind of storage medium that comprises some executable instructions, described executable instruction comprises at least when being performed, implementation language modeling agency's subset of instructions, described language modeling agency comprises the subroutine of setting up the original language model according to the dictionary that is obtained by the corpus that receives and segmentation, and by dynamically updating dictionary according to Statistics and to corpus segmentation again, improve the original language model repeatedly, the subroutine till the threshold value that reaches predictive ability.

22, according to the described storage medium of claim 21, wherein language modeling agency utilizes confusion to measure the definite predictive ability of quantification.

23, according to the described storage medium of claim 21, wherein language modeling agency utilizes the maximum match technology, obtains dictionary and segmentation by the corpus that receives.

24, according to the described storage medium of claim 21, the subroutine of wherein setting up the original language model generates the prefix trees data structure according to the project of decomposing from the corpus that receives, determine N project or be less than the substring of N project according to prefix trees, and utilize the substring of determining to fill dictionary.

25, according to the described storage medium of claim 21, wherein subroutine is improved the original language model repeatedly, and corpus is carried out segmentation again, to reflect improved segmentation probability by determining the frequency of occurrences of each segmentation.

26, according to the described storage medium of claim 25, wherein language modeling agency utilizes the Markov probability of hiding to measure the probability of occurrence of determining each segmentation.

27,, also comprise at least when being performed, by utilizing the subset of instructions of the language model realization application program of setting up by language modeling agency according to the described storage medium of claim 19.

28, a kind of system comprises:

Removably lay storage media drive according to the described storage medium of claim 19; With

Couple with described storage media drive, visit the also subset of instructions of the some executable instructions of executive resident on the storage medium of removably laying at least, thus implementation language modeling agency's performance element.

29, a kind of modeling agency comprises:

Determine the statistics calculator of the likelihood of corpus segmentation; With

A data structure generator is set up the original language model according to the dictionary and the segmentation that are dynamically obtained by the corpus that receives, and is improved language model repeatedly, till the likelihood of corpus segmentation reaches acceptable threshold value.

30, according to the described modeling agency of claim 29, wherein statistics calculator utilizes the Markov modeling technique to determine the likelihood of corpus segmentation.

31, according to the described modeling agency of claim 29, wherein data structure generator generates the prefix trees data structure according to the project of decomposing from the corpus that receives, determine N project or less than the substring of N project according to prefix trees, and utilize the substring of determining to fill dictionary.

32, according to the described modeling of claim 31 agency, wherein statistics calculator is determined the likelihood of the substring that is determined, and wherein the modeling agency attempts to improve the substring likelihood to corpus segmentation again.