CN1387651A - System and iterative method for lexicon, segmentation and language model joint optimization - Google Patents

System and iterative method for lexicon, segmentation and language model joint optimization Download PDF

Info

Publication number
CN1387651A
CN1387651A CN00815294A CN00815294A CN1387651A CN 1387651 A CN1387651 A CN 1387651A CN 00815294 A CN00815294 A CN 00815294A CN 00815294 A CN00815294 A CN 00815294A CN 1387651 A CN1387651 A CN 1387651A
Authority
CN
China
Prior art keywords
segmentation
language model
dictionary
corpus
accordance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN00815294A
Other languages
Chinese (zh)
Other versions
CN100430929C (en
Inventor
王海峰
黄常宁
李凯夫
狄硕
蔡东峰
秦立峰
郭建峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1387651A publication Critical patent/CN1387651A/en
Application granted granted Critical
Publication of CN100430929C publication Critical patent/CN100430929C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method for optimizing a language model is presented comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved.

Description

The system of dictionary, segmentation and language model joint optimization and alternative manner
The application requires the temporary patent application No.60/163850 in proposition on November 5th, 1999 by the present inventor, the right of priority of " An iterative method for lexicon, wordsegmentation and language model joint optimization ".
Technical field
The present invention relates to the language modeling, more specifically to the system and the alternative manner of dictionary, literal segmentation and language mode combined optimization.
Background technology
Recently computing power and development of technologies have promoted the development of powerful application software of new generation, comprise web browser, word processing and speech recognition application programming interface.For example, after two or three initial characters of input domain name, the web browser of latest generation is expected the input of URL(uniform resource locator) (URL) address.Word processor provides improved spelling and syntax check ability, Word prediction and language conversion.Newer speech recognition application programming interface provides similarly has the identification making us admiring and the various features of precision of prediction.For useful, must realize these features substantially in real time to the terminal user.For this performance is provided, many application programs rely on tree data structure and set up simple language model.
Briefly, language mode is measured the likelihood of specifying sentence arbitrarily.That is, language model can obtain the sequence (literal, character, letter etc.) of any clauses and subclauses and estimate the possibility of this sequence.A kind of common approach of setting up existing language mode is the training set according to known text corpus (textual corpus), utilizes the prefix tree data structure to set up N-gram (N letter group) language model.
The use of prefix tree data structure (also being called suffix tree or PAT tree) makes advanced application travel through language model fast, and above-described real-time substantially performance characteristic is provided.Briefly, the N-gram language model is counted in the whole text occurrence number of specific project (literal, character etc.) in a string (size is for N's).Count value is used to calculate the probability of use of this string.Usually, tri-gram (N-gram, N=3 here) method comprises the steps:
(a) text corpus is divided into some projects (character, letter, numeral etc.);
(b) according to less predetermined dictionary and simple predetermined segment algorithm, to described some projects (for example character (C)) segmentations (for example being divided into speech (W)), each W is mapped to one or more C in tree data structure here;
(c) by the occurrence number of counting character string,, take this to predict a series of speech (W by first two words according to corpus (corpus) train language model that separates 1, W 2... W M) probability:
P(W 1,W 2,W 3,...W M)≈∏P(W i|W i-1,W i-2) (1)
There is limitation in the N-gram language model aspect some.At first, the counting procedure that uses in the structure prefix trees is very consuming time.Thereby in fact can only realize less N-gram model (being generally 2-gram or 3-gram).Secondly, along with the increase of the string length (N) of N-gram language model, the required storer of storage prefix trees is by 2 NIncrease.Thereby, for N-gram greater than 3 (they being 3-gram), the required storer of storage N-gram language model, and utilize the required access time of bigger N-gram language model very big.
The N-gram language model of prior art tends to use the dictionary of fixing (less), undue simple segmentation algorithm, and generally only depending on preceding two words predicts current word (with regard to the 3-gram model).
It is general or be exclusively used in the ability of the best word of task that fixing dictionary has limited Model Selection.If a certain word is not present in the dictionary, then with regard to related model, this word does not exist.Thereby less dictionary can not cover the language content of expection.
Segmentation algorithm is comparatively special usually, and is not based on any statistics or semantic principle.Abandon less word too simple segmentation algorithm common fault and adopt bigger word.Thereby this model can not be predicted the less word that is included in the semantically acceptable big character string exactly.
Because the result of above-mentioned restriction uses the language model of prior art dictionary and segmentation algorithm often to be easy to make mistakes.That is, any mistake that produces at dictionary or in the segmentation stage is transmitted in the whole language model, thereby has limited the accuracy and the prediction attribute of language model.
At last, model be confined to contextual maximum two formerly word (with regard to the 3-gram language model) be restricted property equally perhaps need more context because will predict the possibility of word exactly.The limitation of this three aspect of language model causes the forecast quality of this language model relatively poor usually.
Thereby, need a kind of defective and circumscribed obstruction relevant usually that be not subjected to the language modeling technique of prior art, be used for the system and method for dictionary, segmentation algorithm and language model joint optimization.A solution just as they were is provided below.
Summary of the invention
The present invention relates to the system and the alternative manner of dictionary, segmentation and language model joint optimization.In order to overcome the limitation relevant with prior art, the present invention does not rely on predetermined dictionary or segmentation algorithm, in the iterative process of optimizing language model, dynamically generates dictionary and segmentation algorithm on the contrary.According to a kind of realization, a kind of method of improving the language model performance is provided, comprise that the dictionary and the segmentation that obtain according to the text corpus that utilizes the maximum match technology to receive form initial language model, by dynamically updating dictionary according to Statistics and to text corpus segmentation again, the initial language model of refining repeatedly is till reaching the predictive ability threshold value.
Description of drawings
Identical index number is used to represent identical assembly and feature in the accompanying drawing.
Fig. 1 is the block scheme that embodies the computer system of the present invention's instruction;
Fig. 2 is the block scheme that a kind of iteration that realizes according to the present invention forms the illustration modeling agency of dictionary, segmentation and language model;
Fig. 3 is the diagrammatic representation according to the DOMM tree of one aspect of the invention;
Fig. 4 is a process flow diagram of setting up the methodology of DOMM tree;
Fig. 5 is the process flow diagram of the methodology that is used for dictionary, segmentation and language model joint optimization of instruction according to the present invention;
Fig. 6 describes in detail according to the initial dictionary of a kind of generation that realizes of the present invention, and changes dictionary, segmentation and the language model that dynamically produces repeatedly, the process flow diagram of the method step till assembling;
Fig. 7 is the storage medium with some executable instructions of the alternative according to the present invention, and described some executable instructions realize innovation modeling agency of the present invention when being performed.
Embodiment
The present invention relates to the system and the alternative manner of dictionary, segmentation and language model joint optimization.In explanation process of the present invention, quoted the language model of innovation, dynamic order Markov model (DOMM).The U.S. Patent application No.09/XXXXXX of Lee of pending trial etc. at the same time, provide the detailed description of DOMM in " A Method and Apparatus for Generating andManaging a Language Model Data Structure ", the disclosure of this patented claim is contained in this as a reference.
In the discussion here, the present invention of explanation generally speaking who is carried out by one or more conventional computing machines in the executable instruction of computing machine such as program module.In general, program module comprises the routine carrying out special duty or realize particular abstract, program, object, assembly, data structure etc.In addition, those skilled in the art will recognize and can utilize other Computer Systems Organization, comprises handheld apparatus, personal digital assistant, multicomputer system, puts into practice the present invention based on consumption electronic product microprocessor or programmable, network PC, microcomputer, mainframe computer etc.In distributed computer environment, program module not only can be arranged in local memory storage but also can be arranged in remote storage.But it is to be noted under the situation that does not break away from the spirit and scope of the present invention, also can make amendment the architecture and the method for explanation here.
The computer system of illustration
Fig. 1 graphic extension comprises the exemplary computer system 102 according to the innovation language modeling agency 104 of instruction combined optimization dictionary of the present invention, segmentation and language model.Though recognize in Fig. 1, to be described to independent application program, but language modeling agency 104 also can be implemented as application program, for example a kind of function of word processor, web browser, speech recognition system etc.In addition, though be described to software application, but those of skill in the art will recognize and also can realize this innovation modeling agency, for example programmable logic array (PLA), application specific processor, special IC (ASIC), microcontroller etc. in hardware.
According to following explanation, obviously computing machine 102 is to be used for representing the general of any classification or dedicated computing platform, described computing platform realizes the instruction of the present invention that realizes according to first illustration of introducing above when the language modeling agency (LMA) 104 that is endowed innovation.Though recognize here language modeling agency is described as application software, but computer system 102 selectively supports the hardware of LMA 104 to realize.In this respect, for the explanation of LMA 104, the description of following computer system 102 only is illustrative, because under the situation that does not break away from the spirit and scope of the present invention, the better or more weak computer system of availability performance is replaced.
As shown in the figure, computing machine 102 comprises one or more processors 132, system storage 134 and makes the various system components that comprise system storage 134 and bus 136 that processor 132 couples.
In several bus structure of bus 136 representatives any one or multiple bus structure, comprise memory bus or Memory Controller, peripheral bus, any one bus-structured processor or local bus in Accelerated Graphics Port and the various bus structure of use.System storage comprises ROM (read-only memory) (ROM) 138 and random-access memory (ram) 140.For example comprise in starting process, the basic input/output (BIOS) 142 that helps to transmit the basic routine of information between the element in computing machine 102 is kept among the ROM 138.Computing machine 102 also comprises the hard disk drive 144 that the hard disk (not shown) is read and write, to the disc driver 146 of removable disk 148 read-writes and the CD drive 150 that removable CD 152 such as CD ROM, DVD ROM or other optical medium is read and write.Hard disk drive 144, disc driver 146 and CD drive 150 link to each other with bus 136 by scsi interface 154 or some other suitable bus interface.These drivers and their relevant computer-readable mediums provide the non-volatile memories of computer-readable instruction, data structure, program module and other data for computing machine 102.
Though illustration environment described herein adopts hard disk 144, movably disk 148 and CD 152 movably, but those skilled in the art will be appreciated that the computer-readable medium that also can use other type that can preserve the accessible data of computing machine in the operating environment of illustration, for example magnetic tape cassette, flash memory card, digital video disc, random-access memory (ram) ROM (read-only memory) (ROM) or the like.
Some program modules can be kept on hard disk 144, disk 148, CD 152, ROM 138 or the RAM 140, comprise operating system 158, comprise one or more application programs 160, other program module 162 and the routine data 164 (the language model data structure that for example obtains at last etc.) of the innovation LMA104 that embodies the present invention's instruction.The user can be by the input media such as keyboard 166 and pointing device 168 order and information input computing machine 102.Other input media (not shown) can comprise microphone, operating rod, game mat, dish, scanner or the like.These and other input media is by being connected with processor 132 with interface 170 that bus 136 couples.The display device of monitor 172 or other type also links to each other by the Interface ﹠ Bus such as video adapter 174 136.Except monitor 172, personal computer generally includes other peripheral output devices (not shown) such as loudspeaker and printer.
As shown in the figure, computing machine 102 is utilizing and one or more remote computers, for example works in the networked environment that the logic of remote computer 176 connects.Remote computer 176 can be another person's computing machine, personal digital assistant, server, router or other network equipment, network " thin client (thin-client) " PC, peer device or other common network node, and above generally comprising with respect to some or all element of computing machine 102 explanation, but in Fig. 1 a graphic representation storer 178.
As shown in the figure, the logic of describing among Fig. 1 connects and comprises Local Area Network 180 and wide area network (WAN) 182.This networked environment is very usual in office, enterprise-wide. computer networks, Intranet and the Internet.In one embodiment, remote computer 176 is carried out such as by Washington, the Microsoft Corporation of Redmond produces the Internet Web browser program of " Internet Explorer " of also supply and marketing and so on, so that visit and utilize online service.
In the time of in being used in the lan network environment, computing machine 102 links to each other with LAN (Local Area Network) 180 by network interface or adapter 184.In the time of in being used in the WAN network environment, computing machine 102 generally comprises with the wide area network 182 such as the Internet and sets up modulator-demodular unit 186 or other device of communicating by letter.Modulator-demodular unit 186 (can be built-in also can be external) links to each other with bus 136 by I/O (I/O) interface 156.Except network connectivty, I/O interface 156 is also supported one or more printers 188.In networked environment, can be kept in the remote memory with respect to the program module of personal computer 102 or its various piece explanation.Recognize that it is illustrative that represented network connects, and can use other means that establish a communications link between computing machine.
In general, by the data processor programming of the instruction in the various computer-readable recording mediums that are saved in computing machine at different time to computing machine 102.Program and operating system for example generally are distributed on floppy disk or the CD-ROM.Program and operating system are mounted or are loaded into the supplementary storage of computing machine from floppy disk or CD-ROM.During execution, they partly are loaded in the main electronic memory of computing machine at least.When these and other various types of computer-readable recording medium and microprocessor or other data processor comprised the instruction that realizes the innovative step that the following describes or program together, invention described herein comprised such computer-readable recording medium.When computing machine itself was programmed according to method that the following describes and technology, the present invention also comprised this computing machine.In addition, can be to some subassembly programming of computing machine, so that carry out function and the step that describes below.When according to described during to the programming of these subassemblies, the present invention also comprises such subassembly.In addition, invention described herein comprises the data structure on the various storage mediums of being included in that the following describes.
For convenience of explanation, here program and other executable program assembly, for example operating system is expressed as the program block of separation, but will recognize that such program and assembly reside in different the time on the different memory units of computing machine, and is carried out by the data processor of computing machine.
The language modeling agency of illustration
Fig. 2 graphic extension embodies the block scheme of illustration language modeling agency (LMA) (104) of the present invention's instruction.As shown in the figure, language modeling agency 104 is made up of analysis engine 204, storer 206 and optional one or more HELPER APPLICATIONS (for example graphic user interface, predicted application program, verifying application programs, estimation application program etc.) 208 of one or more controllers 202, innovation.They link to each other by communication as shown in the figure.Though recognize in Fig. 2, to be described as some different parts, but one or more function element of LMA 104 also can combine.In this respect, under the situation that does not break away from the spirit and scope of the present invention, can adopt the modeling agency of the dynamic dictionary of more complicated or better simply iteration combined optimization, segmentation and language model.
Shown in as above indirect, though be described as independent function element, LMA 104 also can be realized as more advanced application, for example a kind of function of word processor, web browser, speech recognition system or languages switching system.In this respect, 202 pairs of one or more directive commands from father's application program of the controller of LMA 104 are reacted, and call the feature of LMA104 selectively.On the other hand, LMA 104 also can be implemented as independent language modeling tool, provides the user interface (208) of the feature that selectively realizes LMA 104 described below to the user.
Under any situation, the controller 202 of LMA 104 calls one or more functions of analysis engine 204 selectively, thereby optimizes language model according to the dictionary and the segmentation algorithm of dynamic generation.Thereby except being configured to realize the instruction of the present invention, controller 202 is used for representing any one control system in some alternative control system as known in the art, includes but is not limited to microprocessor, programmable logic array (PLA), micro computer, special IC (ASIC) or the like.In alternative realization, controller 202 is used for representing a series of executable instruction that realizes above-mentioned steering logic.
As shown in the figure, the analysis engine 204 of innovation by Markov probability calculation device 212, comprise frequency computation part subroutine 213, dynamically the dictionary data structure generator 210 and the data structure storage manager 218 that generate subroutine 214 and dynamic segmentation subroutine 216 constitutes.When receiving outside indication, a certain example that controller 202 calls analysis engine 204 selectively forms, revises and optimize statistical language model (SLM).More particularly, opposite with existing language modeling technique, analysis engine 204 produces the statistical language model data structure according to the Markov transition probability between the single project (for example character, letter, numeral etc.) of text corpus (for example one or more groups text) substantially.In addition, as what will illustrate, analysis engine 204 utilizes data as much as possible (being called " linguistic context (context) " or " ordering (order) ") to come the probability of computational item string.In this respect, language model of the present invention is fittingly called dynamic order Markov model (DOMM).
Call when setting up the DOMM data structure when controlled device 202, analysis engine 204 calls data structure generator 210 selectively.In response, data structure generator 210 is set up and is made up of some nodes (relevant with each project in some projects), and represents internodal subordinative tree data structure.As mentioned above, tree data structure is called DOMM data structure or DOMM tree here.Controller 202 receives text corpus, and at least a subclass of text corpus is saved in the storer 206 as dynamic training set 222, will produce language model according to dynamic training set 222.Recognize in alternative, also can use predetermined training set.
In case receive dynamic training set, frequency computation part subroutine 213 is fetched a subclass of training set 222 at least for analysis.Frequency computation part subroutine 213 is determined the frequency of occurrences of concentrated each project (character, letter, numeral, word etc.) of training set zygote.According to internodal dependency, data structure generator 210 is set each allocation of items to DOMM suitable node, and frequency values (C is arranged i) indication and comparison position (b i).
Markov probability calculation device 212 is according to the probability of linguistic context (j) computational item (character, letter, numeral etc.) of relevant item.More particularly, according to instruction of the present invention, the Markov probability (C of specific project i) depend on the character formerly as much as possible of data " permission ", in other words:
P(C 1,C 2,C 3,...,C N)≈∏P(C I|C I-1,C I-2,C I-3,...,C J) (2)
Markov probability calculation device 212 is different from character C as the number of characters of linguistic context (j) i, C I-1, C I-2, C I-3Deng " dynamically " quantity of each sequence.According to a kind of realization, the number of characters that depends on linguistic context (j) that Markov probability calculation device 212 calculates depends in part on the frequency values of each character at least, i.e. their ratios of occurring in whole text corpus.More particularly, if under the situation of the project of determining text corpus, the minimum frequency of occurrences of Markov probability calculation device 212 uncertain at least specific projects then owing to uncorrelated with statistics, may be wiped out it (promptly getting rid of) from tree data structure.According to an embodiment, the low-limit frequency threshold value is three (3).
Shown in as above indirect, analysis engine 204 does not rely on fixed lexicon or simple segmentation algorithm (they all are easy to make mistakes).On the contrary, analysis engine 204 calls dynamic segmentation subroutine 216 selectively project (for example character or letter) branch bunchiness (for example word).More precisely, segmentation subroutine 216 is divided into subclass (bulk) to training set 222, and poly-degree (being that a kind of of similarity between project measures in the subclass) in calculating.Segmentation subroutine 216 is carried out the calculating of segmentation and cohesion repeatedly, till the interior poly-degree of each subclass reaches predetermined threshold.
Dictionary generates subroutine 214 and is called, thereby dynamically generates dictionary 220 and it is saved in the storer 206.According to a kind of realization, dictionary generates subroutine 214 and analyzes segmentation result, and surpasses the string generation dictionary of threshold value according to the Markov transition probability.In this respect, dictionary generates subroutine 214 and produces dynamic dictionary 220 according to the string above the predetermined Markov transition probability that obtains from the one or more language models that produced by analysis engine 204.Therefore, be different from the existing language model that depends on error-prone known fixed dictionary, analysis engine 204 is according to the one or more language models that form in a period of time, produce statistical significance more important, add up the dictionary of string accurately.According to an embodiment, dictionary 220 is included in and forms in the follow-up language model, " the virtual corpus " that Markov probability calculation device 212 is relied on (except that dynamically training is gathered).
Thereby revise when being called or when utilizing DOMM language model data structure, analysis engine 204 calls an example of data structure storage manager 218 selectively.According to an aspect of the present invention, data structure storage manager 218 utilizes system storage and extended memory to preserve the DOMM data structure.More particularly, following below with reference to Fig. 6 and 7 be described in more detail like that, data structure storage manager 218 adopts WriteNode subroutine and ReadNote subroutine (not shown) the node subclass of most recently used DOMM data structure to be saved in the one-level cache memory 224 of system storage 206, simultaneously the node that seldom uses is recently transferred in the extended memory (for example disk file in hard disk drive 144 or some remote actuator), thereby improved performance characteristic is provided.In addition, the l2 cache memory of system storage 206 is used to gather write command, and till reaching predetermined threshold value, in this threshold point, set WriteNode order is sent in the appropriate location of data structure storage manager in storer.Though be described as independently function element, but person of skill in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, data structure storage manager 218 also can be combined into the function element of controller 202.
The data structure of illustration-dynamic order Markov model (DOMM) tree
Fig. 3 represents the schematic diagram of the illustration dynamic order Markov model tree data structure 300 of instruction according to the present invention.For how bright DOMM tree data structure 300 from the principle constitutes, Fig. 3 has provided by The English alphabet, promptly A, B, C ... the illustration DOMM data structure 300 of the language model of Z-shaped one-tenth.As shown in the figure, DOMM tree 300 comprises one or more root nodes 302 and one or more slave node 304, these nodes are relevant with a project (character, letter, numeral, word etc.) of text corpus, and connected with the dependency between the expression node by logic.According to a realization of the present invention, root node 302 is made up of a project and a frequency values (for example the count value of how many times appears in this project in text corpus).Certain one deck under root node layer 302, slave node is arranged to the y-bend subtree, and wherein each node comprises a relatively position (b i), the associated project of this node (A, B ...) and the frequency values (C of this project N).
Thereby from the root node relevant with item B 306, the y-bend subtree is made up of the slave node 308-318 of the relation between the expression node and their frequency of occurrences.Given this principle example will be appreciated that from root node, for example node 306 beginning, and the complexity of searching of DOMM tree is near log (N), and N is the sum of the node that will search for.
As above indirectly shown in, the big I of DOMM tree 300 surpasses the free space in the primary memory of the storer 206 of LMA 104 and/or computer system 102.Therefore, data structure storage manager 218 is convenient to cross over primary memory (for example 140 and/or 260) DOMM data tree structure 300 is saved in the storage space of expansion, for example such as the hard disk drive 144 of computer system 102 in the disk file on the main storage means.
The operation of illustration and realization
Introduced function of the present invention and notion element with reference to figure 1-3, acted on behalf of 104 operation below with reference to the language modeling of Fig. 5-10 explanation innovation.
Set up the DOMM data tree structure
Fig. 4 is according to an aspect of the present invention, sets up the process flow diagram of the methodology of dynamic order Markov model (DOMM).Shown in as above indirect, language modeling agency 104 can directly be called by user or advanced application.In response, the controller 202 of LMA 104 calls an example of analysis engine 204 selectively, and text corpus (for example one or more document) is loaded in the storer 206 as dynamic training set 222, and is divided into subclass (sentence for example, verses etc.), square frame 402.In response, data structure generator 210 is given node in the data structure each allocation of items of this subclass, and calculates the frequency values of this project, square frame 404.According to a kind of realization,, then call the frequency of occurrences that frequency computation part subroutine 213 is determined each project in the training set zygote collection in case data structure generator has utilized this subclass to fill this data structure.
In square frame 406, data structure generator determines whether to exist other subclass of training set, if then read next subclass at square frame 408, and continue this process at square frame 404.In alternative realization, before calling frequency computation part subroutine 213, this data structure is filled on data structure generator 210 each subclass ground.In alternative, the frequency computation part subroutine is only counted each project when it is placed into the interdependent node of data structure.
If in square frame 406, each project that data structure generator 210 adds training set 222 for fully data structure 300, data structure generator 210 deleted data structure optionally then, square frame 410.The data structure 300 that can adopt some kinds of mechanism deletions as a result of to obtain.
The methodology of dictionary, segmentation and language model joint optimization
Fig. 5 is the process flow diagram of the methodology of dictionary, segmentation and the language model joint optimization of instruction according to the present invention.As shown in the figure, this method starts from square frame 400, in square frame 400, calls LM 104, and sets up the prefix trees of a subclass of the text corpus that receives at least.More particularly, as shown in Figure 4, modeling agency 104 data structure generator 210 is analyzed the text corpus that receives, and selects a subclass as the training set at least, builds upright DOMM tree jointly according to this training set.
In square frame 502, set up a very big dictionary according to prefix trees, this dictionary is carried out pre-service, thereby remove some obvious illogical word.More particularly, call dictionary and generate subroutine 214, set up initial dictionary according to prefix trees.According to a kind of realization, utilize its length less than a certain predetermined value, for example all substrings of ten (10) individual projects (i.e. slave node from the root node to the maximum, this substring are 10 nodes or less than 10 nodes) are set up initial dictionary according to prefix trees.In case compilation is finished initial dictionary, dictionary generates subroutine 214 and reduces the number of this dictionary (for example referring to following square frame 604) by deleting some obvious illogical word.According to a kind of realization, dictionary generates subroutine 214 and appends on the predetermined dictionary gathering the new initial dictionary that produces according to the training of the text corpus that receives at least.
In square frame 504, utilize initial dictionary at least to the training set segmentation of the text corpus that receives.More particularly, call dynamic segmentation subroutine 216, produce initial segmentation text corpus at least to the training set segmentation of the text corpus that receives.Person of skill in the art will appreciate that and have various methods, for example regular length segmentation, maximum match or the like the segmentation of training text corpus.Also do not producing under the situation of statistical language model (SLM) according to the text corpus that receives, dynamic segmentation subroutine 216 utilizes the maximum match technology that initial segmentation text corpus is provided for this reason.Therefore, segmentation subroutine 216 starts from the starting point of string (the perhaps branch of DOMM tree), and the search dictionary, checks initial project (I 1) whether be (one-item) " word ".The segmentation subroutine makes up this project and next project in the string subsequently, finds combined result (I for example whether to understand in this dictionary form with " word " 1I 2), and the like.According to a kind of realization, the longest string (I of the project that in dictionary, finds 1, I 2... I N) be considered to the correct segmentation of this string.Recognize that under the situation that does not break away from the spirit and scope of the present invention, segmentation subroutine 216 can be utilized more complicated maximum matching algorithm.
After forming initial dictionary and segmentation according to the training text corpus, enter iterative process, wherein dictionary, segmentation and language model be by combined optimization, square frame 506.More particularly, as what below will be described in more detail, the iteration optimization of innovation adopts the statistical language modeling method dynamically to adjust segmentation and dictionary, thereby the language model of optimization is provided.That is, be different from existing language modeling technique, modeling agency 104 does not rely on the predetermined static dictionary, and perhaps undue simple segmentation algorithm produces language model.On the contrary, modeling agency 104 utilizes the text corpus that receives, and perhaps utilizes a subclass (training set) of the text corpus that receives dynamically to produce dictionary and segmentation at least, thereby produces the language model of optimizing.In this respect, the language model of modeling agency 104 generations does not exist common and existing modeling relevant defective and limitation.
After the process of innovation of introducing among Fig. 5, Fig. 6 provides the more detailed process flow diagram that produces initial dictionary according to a kind of realization of the present invention, thus and the iterative process of refining dictionary and segmentation optimization language model.As before, this method starts from setting up according to the text corpus that receives the step 400 (Fig. 4) of prefix trees.As mentioned above, can utilize whole text corpus, perhaps utilize a subclass (being called training corpus) of whole text corpus to set up prefix trees.
In square frame 502, the process that produces initial dictionary starts from square frame 602, and wherein dictionary generates subroutine 214 and has substring (the perhaps branch of prefix trees) less than the project of predetermined number by identification, produces initial dictionary according to prefix trees.According to a kind of realization, dictionary generates the substring that subroutine 214 is determined ten (10) individual projects or is less than 10 projects, thereby constitutes initial dictionary.In square frame 604, dictionary generates subroutine 214 at the initial dictionary that obvious illogical substring analysis produces in step 602, remove these substrings from initial dictionary.That is, dictionary generates subroutine 214 and analyzes illogical or impossible word in the initial dictionary substring, and removes these words from dictionary.For initially deleting, call dynamic segmentation subroutine 216 at least to the training set segmentation of the text corpus that receives, produce the corpus of segmentation.According to a kind of realization, maximum matching algorithm is used to carry out segmentation according to initial dictionary.Call frequency analysis subroutine 213 subsequently, calculate the frequency of occurrences of each word in the text corpus that receives in the dictionary, and dictionary is classified according to the frequency of occurrences.Determine the word that frequency is minimum and from dictionary, delete this word.Can determine the deletion and the threshold value of segmentation again according to the size of corpus.According to a kind of realization, the corpus of 600M project can utilize 500 frequency threshold to be comprised in this dictionary.Like this, can from initial dictionary, delete most obvious illogical words.
In case produce and delete initial dictionary in step 502, the then text corpus segmentation of receiving according to initial dictionary butt joint, square frame 504 to small part.As mentioned above, according to a kind of realization, utilize the maximum match method to finish the initial fragment of text corpus.
In case finish initial dictionary and text corpus fragmentation procedure, the iterative process that then dynamically changes dictionary and segmentation begins according to the text corpus that receives (perhaps training set) optimization statistical language model (SLM), square frame 506.As shown in the figure, this program starts from square frame 606, and wherein Markov probability calculation device 212 utilizes initial dictionary and segmentation to bring into use the segmentation text corpus to carry out the language model training.That is, given initial dictionary and initial fragment can produce statistical language model by it.Though should notice that language model does not have benefited from the dictionary of refining and based on the segmentation (this will develop into following step) of statistics, language model is from as the basis basically with the text corpus that receives.Thereby, though initial language model.
In square frame 608, after carrying out the original language model training, utilize based on the segmentation of SLM text corpus (perhaps training set) segmentation again to segmentation.Known sentence w1, w2 ... under the situation of wn, there be the possible approach (M 〉=1) of M kind to its segmentation.Dynamic segmentation subroutine 216 is calculated each segmentation (S according to the N-gram statistical language model i) probability (p i).According to a kind of realization, segmentation subroutine 216 utilizes tri-gram (being N=3) statistical language model to determine the probability of any given segmentation.Adopt the Viterbi searching algorithm to find out most probable segmentation S k, here:
S k=arg?max(p i) (3)
In square frame 610, utilize the text corpus of the segmentation again that obtains by above-mentioned segmentation to upgrade dictionary based on SLM.According to a kind of realization, modeling agency 104 calls frequency assignation subroutine 213 and calculates the frequency of occurrences of each word in the text corpus that receives in the dictionary, according to the frequency of occurrences dictionary is classified.Determine the word that frequency is minimum, and it is deleted from dictionary.Subsequently when recomputating the single counting of all these words, must this word occur and be divided into less word again.Can determine this deletion and the threshold value of segmentation again according to the size of corpus.According to a kind of realization, the corpus of 600M project can be utilized as 500 frequency threshold and be comprised in this dictionary.
In square frame 612, upgrade language model, with the dictionary that reflects dynamic generation with based on the segmentation of SLM, the measuring of Markov probability calculation device 212 computational language model confusions (being that opposite probability is measured).If confusion continues to assemble (convergence 0), promptly improve, then continue this program at square frame 608, in square frame 608,, revise dictionary and segmentation again further improving under the situation of language model performance (measuring) intentionally with confusion.If in square frame 614, determine the modification recently of dictionary and segmentation is not improved language model, then determine further at square frame 616 whether confusion has reached acceptable threshold value.If then this program stops.
If but language model does not also reach acceptable confusion threshold value, then at square frame 608, dictionary generates subroutine 214 and delete the minimum word of the frequency of occurrences in corpus from dictionary, at square frame 618 this word is divided into littler word again, and program proceeds to square frame 610.
According to the above description, recognize with on statistics at least based on the dictionary of the dynamic generation of the subclass that receives corpus and chopping rule as prerequisite, the language modeling of innovation agency 104 produces the language model of optimizing.In this respect, compare with existing language model, the language model that obtains at last has improved calculating and predictive ability.
Alternative
Fig. 7 stores some instructions on it according to another embodiment of the present invention, comprises the block scheme of the storage medium of the instruction that realizes innovation modeling agency of the present invention.In general, Fig. 7 illustrates the storage medium/device 700 with the some executable instruction 702 that is stored thereon, and described executable instruction 702 comprises at least when being performed, and realizes a subclass of innovation modeling agency's 104 of the present invention instruction.When being carried out by the processor of main system, executable instruction 702 realizes the modelings agency, produces for carrying out on main system or being applicable to that otherwise the statistical language model of text corpus of any main frame use of other application program of main system represents.
Storage medium 700 used herein is to be used for representing in some memory storages known to those skilled in the art and/or the storage medium any one, for example volatile storage, Nonvolatile memory devices, magnetic-based storage media, optical storage medium or the like.Similarly, executable instruction is to be used for reflecting in some software languages as known in the art any one, for example C++, Visual Basic, HTML (HTML), Java, extension markup language (XML) or the like.In addition, recognize storage medium/device 700 needn't with any main system colocated.That is, storage medium/device 700 can reside in and the executive system coupled in communication, and can be performed in the remote server of system's visit.Therefore, the software of Fig. 7 is realized being counted as illustrative, realizes within the spirit and scope of the present invention because can expect alternative storage medium and software.
Though aspect the language of architectural feature and/or method step, the present invention is being described, is understanding that the present invention who limits need not be confined to illustrated concrete feature or step in additional claim.On the contrary, just as the illustration form of the invention that realizes prescription these concrete feature and steps are disclosed.

Claims (32)

1, a kind of method comprises:
Form initial language model according to the dictionary and the segmentation that obtain by the corpus that receives; With
By according to Statistics, dynamically upgrade dictionary and to corpus segmentation again, refining original language model repeatedly is till reaching the predictive ability threshold value.
2, in accordance with the method for claim 1, the step of wherein setting up initial language model comprises:
Generate the prefix trees data structure according to the project of decomposing from the corpus that receives;
Determine N project or less than the substring of N project according to the prefix trees data structure;
Utilize the substring of determining to fill described dictionary.
3, in accordance with the method for claim 2, wherein N equals 3.
4, in accordance with the method for claim 1, wherein the step of iterate improvement original language model comprises:
By determining the probability of occurrence of each segmentation, to described corpus segmentation again.
5, in accordance with the method for claim 4, wherein utilize the N-gram language model to calculate the probability of occurrence of determining segmentation.
6, in accordance with the method for claim 5, wherein the N-grim language model is the 3-gram language model.
7, in accordance with the method for claim 4, wherein utilize two segmentations formerly to calculate the probability of occurrence of determining segmentation.
8, in accordance with the method for claim 4, wherein the step of iterate improvement language model comprises:
Corpus according to segmentation again upgrades dictionary.
9, in accordance with the method for claim 8, wherein upgrading dictionary comprises:
Determine the frequency of occurrences of each word in the corpus that receives of dictionary; With
The minimum word of the determined frequency of deletion from dictionary.
10, in accordance with the method for claim 9, also comprise:
The word of deletion is divided into two or more less words again, and utilizes the word of segmentation again to upgrade dictionary.
11, in accordance with the method for claim 8, also comprise:
The dictionary that utilize to upgrade and the corpus of segmentation again, the prediction of computational language model is measured.
12, in accordance with the method for claim 11, predict that wherein measuring is the language model confusion.
13, in accordance with the method for claim 11, also comprise:
Determine whether the predictive ability of language model is enhanced owing to upgrade and the result of segmentation again; With
If predictive ability is enhanced, then carry out other renewal and segmentation again, till determining not further improvement.
14, in accordance with the method for claim 1, wherein utilize the maximum match technology to obtain the original language model.
15, in accordance with the method for claim 1, wherein predictive ability is quantized and is expressed as confusion and measures.
16, in accordance with the method for claim 15, wherein improve language model, up to confusion measure be lowered to be lower than acceptable prediction threshold value till.
17, in accordance with the method for claim 1, also comprise:
In application program, utilize improved repeatedly language model to predict the possibility of another corpus.
18, in accordance with the method for claim 17, wherein said application program is one or more of spelling and/or grammar checker, word-processing application, language translation application program, speech recognition application programming interface etc.
19, a kind of storage medium that comprises some executable instructions, described executable instruction comprise at least when being performed, and realize subset of instructions in accordance with the method for claim 1.
20, a kind of computer system comprises:
Wherein preserve the memory storage of some executable instructions;
Couple with described memory storage, carry out the subset of instructions of described some executable instructions at least, thereby realize performance element in accordance with the method for claim 1.
21, a kind of storage medium that comprises some executable instructions, described executable instruction comprises at least when being performed, implementation language modeling agency's subset of instructions, described language modeling agency comprises the subroutine of setting up the original language model according to the dictionary that is obtained by the corpus that receives and segmentation, and by dynamically updating dictionary according to Statistics and to corpus segmentation again, improve the original language model repeatedly, the subroutine till the threshold value that reaches predictive ability.
22, according to the described storage medium of claim 21, wherein language modeling agency utilizes confusion to measure the definite predictive ability of quantification.
23, according to the described storage medium of claim 21, wherein language modeling agency utilizes the maximum match technology, obtains dictionary and segmentation by the corpus that receives.
24, according to the described storage medium of claim 21, the subroutine of wherein setting up the original language model generates the prefix trees data structure according to the project of decomposing from the corpus that receives, determine N project or be less than the substring of N project according to prefix trees, and utilize the substring of determining to fill dictionary.
25, according to the described storage medium of claim 21, wherein subroutine is improved the original language model repeatedly, and corpus is carried out segmentation again, to reflect improved segmentation probability by determining the frequency of occurrences of each segmentation.
26, according to the described storage medium of claim 25, wherein language modeling agency utilizes the Markov probability of hiding to measure the probability of occurrence of determining each segmentation.
27,, also comprise at least when being performed, by utilizing the subset of instructions of the language model realization application program of setting up by language modeling agency according to the described storage medium of claim 19.
28, a kind of system comprises:
Removably lay storage media drive according to the described storage medium of claim 19; With
Couple with described storage media drive, visit the also subset of instructions of the some executable instructions of executive resident on the storage medium of removably laying at least, thus implementation language modeling agency's performance element.
29, a kind of modeling agency comprises:
Determine the statistics calculator of the likelihood of corpus segmentation; With
A data structure generator is set up the original language model according to the dictionary and the segmentation that are dynamically obtained by the corpus that receives, and is improved language model repeatedly, till the likelihood of corpus segmentation reaches acceptable threshold value.
30, according to the described modeling agency of claim 29, wherein statistics calculator utilizes the Markov modeling technique to determine the likelihood of corpus segmentation.
31, according to the described modeling agency of claim 29, wherein data structure generator generates the prefix trees data structure according to the project of decomposing from the corpus that receives, determine N project or less than the substring of N project according to prefix trees, and utilize the substring of determining to fill dictionary.
32, according to the described modeling of claim 31 agency, wherein statistics calculator is determined the likelihood of the substring that is determined, and wherein the modeling agency attempts to improve the substring likelihood to corpus segmentation again.
CNB008152942A 1999-11-05 2000-11-03 System and iterative method for lexicon, segmentation and language model joint optimization Expired - Fee Related CN100430929C (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16385099P 1999-11-05 1999-11-05
US60/163,850 1999-11-05
US09/609,202 2000-06-30
US09/609,202 US6904402B1 (en) 1999-11-05 2000-06-30 System and iterative method for lexicon, segmentation and language model joint optimization

Publications (2)

Publication Number Publication Date
CN1387651A true CN1387651A (en) 2002-12-25
CN100430929C CN100430929C (en) 2008-11-05

Family

ID=26860000

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB008152942A Expired - Fee Related CN100430929C (en) 1999-11-05 2000-11-03 System and iterative method for lexicon, segmentation and language model joint optimization

Country Status (5)

Country Link
US (2) US6904402B1 (en)
JP (1) JP2003523559A (en)
CN (1) CN100430929C (en)
AU (1) AU4610401A (en)
WO (1) WO2001037128A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100380370C (en) * 2003-04-30 2008-04-09 罗伯特·博世有限公司 Method for statistical language modeling in speech recognition
CN101266599B (en) * 2005-01-31 2010-07-21 日电(中国)有限公司 Input method and user terminal
CN1916889B (en) * 2005-08-19 2011-02-02 株式会社日立制作所 Language material storage preparation device and its method
CN101097488B (en) * 2006-06-30 2011-05-04 2012244安大略公司 Method for learning character fragments from received text and relevant hand-hold electronic equipments
CN103034628A (en) * 2011-10-27 2013-04-10 微软公司 Functionality for normalizing linguistic items
CN103201707A (en) * 2010-09-29 2013-07-10 触摸式有限公司 System and method for inputting text into electronic devices
CN105159890A (en) * 2014-06-06 2015-12-16 谷歌公司 Generating representations of input sequences using neural networks
CN105786796A (en) * 2008-04-16 2016-07-20 谷歌公司 Segmenting words using scaled probabilities
CN107111609A (en) * 2014-12-12 2017-08-29 全方位人工智能股份有限公司 Lexical analyzer for neural language performance identifying system
CN107427732A (en) * 2016-12-09 2017-12-01 香港应用科技研究院有限公司 For the system and method for the data structure for organizing and handling feature based
US10613746B2 (en) 2012-01-16 2020-04-07 Touchtype Ltd. System and method for inputting text
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium

Families Citing this family (101)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7750891B2 (en) 2003-04-09 2010-07-06 Tegic Communications, Inc. Selective input system based on tracking of motion parameters of an input device
US7821503B2 (en) * 2003-04-09 2010-10-26 Tegic Communications, Inc. Touch screen and graphical user interface
WO2000074240A1 (en) * 1999-05-27 2000-12-07 America Online Keyboard system with automatic correction
US7030863B2 (en) 2000-05-26 2006-04-18 America Online, Incorporated Virtual keyboard system with automatic correction
US7286115B2 (en) 2000-05-26 2007-10-23 Tegic Communications, Inc. Directional input system with automatic correction
US20050044148A1 (en) * 2000-06-29 2005-02-24 Microsoft Corporation Method and system for accessing multiple types of electronic content
US7020587B1 (en) * 2000-06-30 2006-03-28 Microsoft Corporation Method and apparatus for generating and managing a language model data structure
CN1226717C (en) * 2000-08-30 2005-11-09 国际商业机器公司 Automatic new term fetch method and system
DE60029456T2 (en) * 2000-12-11 2007-07-12 Sony Deutschland Gmbh Method for online adjustment of pronunciation dictionaries
WO2002097663A1 (en) * 2001-05-31 2002-12-05 University Of Southern California Integer programming decoder for machine translation
US8214196B2 (en) 2001-07-03 2012-07-03 University Of Southern California Syntax-based statistical translation model
US7493258B2 (en) * 2001-07-03 2009-02-17 Intel Corporation Method and apparatus for dynamic beam control in Viterbi search
JP2003036088A (en) * 2001-07-23 2003-02-07 Canon Inc Dictionary managing apparatus for voice conversion
AU2003269808A1 (en) * 2002-03-26 2004-01-06 University Of Southern California Constructing a translation lexicon from comparable, non-parallel corpora
US7269548B2 (en) * 2002-07-03 2007-09-11 Research In Motion Ltd System and method of creating and using compact linguistic data
CA2523992C (en) * 2003-05-28 2012-07-17 Loquendo S.P.A. Automatic segmentation of texts comprising chunks without separators
US7711545B2 (en) * 2003-07-02 2010-05-04 Language Weaver, Inc. Empirical methods for splitting compound words with application to machine translation
US8548794B2 (en) 2003-07-02 2013-10-01 University Of Southern California Statistical noun phrase translation
US7941310B2 (en) * 2003-09-09 2011-05-10 International Business Machines Corporation System and method for determining affixes of words
US7698125B2 (en) * 2004-03-15 2010-04-13 Language Weaver, Inc. Training tree transducers for probabilistic operations
US8296127B2 (en) * 2004-03-23 2012-10-23 University Of Southern California Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US8666725B2 (en) 2004-04-16 2014-03-04 University Of Southern California Selection and use of nonstatistical translation components in a statistical machine translation framework
DE112005002534T5 (en) 2004-10-12 2007-11-08 University Of Southern California, Los Angeles Training for a text-to-text application that uses a string-tree transformation for training and decoding
PT1666074E (en) 2004-11-26 2008-08-22 Ba Ro Gmbh & Co Kg Disinfection lamp
CN100530171C (en) * 2005-01-31 2009-08-19 日电(中国)有限公司 Dictionary learning method and devcie
CN101124579A (en) * 2005-02-24 2008-02-13 富士施乐株式会社 Word translation device, translation method, and translation program
US7996219B2 (en) * 2005-03-21 2011-08-09 At&T Intellectual Property Ii, L.P. Apparatus and method for model adaptation for spoken language understanding
US8676563B2 (en) 2009-10-01 2014-03-18 Language Weaver, Inc. Providing human-generated and machine-generated trusted translations
US8886517B2 (en) 2005-06-17 2014-11-11 Language Weaver, Inc. Trust scoring for language translation systems
US7974833B2 (en) 2005-06-21 2011-07-05 Language Weaver, Inc. Weighted system of expressing language information using a compact notation
US7389222B1 (en) 2005-08-02 2008-06-17 Language Weaver, Inc. Task parallelization in a text-to-text system
US7813918B2 (en) * 2005-08-03 2010-10-12 Language Weaver, Inc. Identifying documents which form translated pairs, within a document collection
US7624020B2 (en) * 2005-09-09 2009-11-24 Language Weaver, Inc. Adapter for allowing both online and offline training of a text to text system
US20070078644A1 (en) * 2005-09-30 2007-04-05 Microsoft Corporation Detecting segmentation errors in an annotated corpus
US7328199B2 (en) * 2005-10-07 2008-02-05 Microsoft Corporation Componentized slot-filling architecture
US7941418B2 (en) * 2005-11-09 2011-05-10 Microsoft Corporation Dynamic corpus generation
US7606700B2 (en) * 2005-11-09 2009-10-20 Microsoft Corporation Adaptive task framework
US7822699B2 (en) * 2005-11-30 2010-10-26 Microsoft Corporation Adaptive semantic reasoning engine
US20070106496A1 (en) * 2005-11-09 2007-05-10 Microsoft Corporation Adaptive task framework
US10319252B2 (en) 2005-11-09 2019-06-11 Sdl Inc. Language capability assessment and training apparatus and techniques
US20070130134A1 (en) * 2005-12-05 2007-06-07 Microsoft Corporation Natural-language enabling arbitrary web forms
US7933914B2 (en) 2005-12-05 2011-04-26 Microsoft Corporation Automatic task creation and execution using browser helper objects
US7831585B2 (en) * 2005-12-05 2010-11-09 Microsoft Corporation Employment of task framework for advertising
US7835911B2 (en) * 2005-12-30 2010-11-16 Nuance Communications, Inc. Method and system for automatically building natural language understanding models
US20090006092A1 (en) * 2006-01-23 2009-01-01 Nec Corporation Speech Recognition Language Model Making System, Method, and Program, and Speech Recognition System
EP2511833B1 (en) * 2006-02-17 2020-02-05 Google LLC Encoding and adaptive, scalable accessing of distributed translation models
US20070203869A1 (en) * 2006-02-28 2007-08-30 Microsoft Corporation Adaptive semantic platform architecture
US7996783B2 (en) * 2006-03-02 2011-08-09 Microsoft Corporation Widget searching utilizing task framework
US8943080B2 (en) 2006-04-07 2015-01-27 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US20070271087A1 (en) * 2006-05-18 2007-11-22 Microsoft Corporation Language-independent language model using character classes
US7558725B2 (en) * 2006-05-23 2009-07-07 Lexisnexis, A Division Of Reed Elsevier Inc. Method and apparatus for multilingual spelling corrections
WO2007142102A1 (en) * 2006-05-31 2007-12-13 Nec Corporation Language model learning system, language model learning method, and language model learning program
US8886518B1 (en) 2006-08-07 2014-11-11 Language Weaver, Inc. System and method for capitalizing machine translated text
US8433556B2 (en) 2006-11-02 2013-04-30 University Of Southern California Semi-supervised training for statistical word alignment
US9122674B1 (en) 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US8468149B1 (en) 2007-01-26 2013-06-18 Language Weaver, Inc. Multi-lingual online community
US8225203B2 (en) 2007-02-01 2012-07-17 Nuance Communications, Inc. Spell-check for a keyboard system with automatic correction
US8201087B2 (en) 2007-02-01 2012-06-12 Tegic Communications, Inc. Spell-check for a keyboard system with automatic correction
US9465791B2 (en) * 2007-02-09 2016-10-11 International Business Machines Corporation Method and apparatus for automatic detection of spelling errors in one or more documents
US8615389B1 (en) 2007-03-16 2013-12-24 Language Weaver, Inc. Generation and exploitation of an approximate language model
US8831928B2 (en) 2007-04-04 2014-09-09 Language Weaver, Inc. Customizable machine translation service
US8825466B1 (en) 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
US7917355B2 (en) * 2007-08-23 2011-03-29 Google Inc. Word detection
US8010341B2 (en) * 2007-09-13 2011-08-30 Microsoft Corporation Adding prototype information into probabilistic models
US8521516B2 (en) * 2008-03-26 2013-08-27 Google Inc. Linguistic key normalization
US8353008B2 (en) * 2008-05-19 2013-01-08 Yahoo! Inc. Authentication detection
US9411800B2 (en) * 2008-06-27 2016-08-09 Microsoft Technology Licensing, Llc Adaptive generation of out-of-dictionary personalized long words
US8301437B2 (en) * 2008-07-24 2012-10-30 Yahoo! Inc. Tokenization platform
US8462123B1 (en) * 2008-10-21 2013-06-11 Google Inc. Constrained keyboard organization
CN101430680B (en) 2008-12-31 2011-01-19 阿里巴巴集团控股有限公司 Segmentation sequence selection method and system for non-word boundary marking language text
US8326599B2 (en) * 2009-04-21 2012-12-04 Xerox Corporation Bi-phrase filtering for statistical machine translation
US8990064B2 (en) 2009-07-28 2015-03-24 Language Weaver, Inc. Translating documents based on content
US8380486B2 (en) 2009-10-01 2013-02-19 Language Weaver, Inc. Providing machine-generated translations and corresponding trust levels
US10417646B2 (en) 2010-03-09 2019-09-17 Sdl Inc. Predicting the cost associated with translating textual content
US11003838B2 (en) 2011-04-18 2021-05-11 Sdl Inc. Systems and methods for monitoring post translation editing
US8972260B2 (en) * 2011-04-20 2015-03-03 Robert Bosch Gmbh Speech recognition using multiple language models
US8694303B2 (en) 2011-06-15 2014-04-08 Language Weaver, Inc. Systems and methods for tuning parameters in statistical machine translation
US8886515B2 (en) 2011-10-19 2014-11-11 Language Weaver, Inc. Systems and methods for enhancing machine translation post edit review processes
US8942973B2 (en) 2012-03-09 2015-01-27 Language Weaver, Inc. Content page URL translation
US10261994B2 (en) 2012-05-25 2019-04-16 Sdl Inc. Method and system for automatic management of reputation of translators
CN102799676B (en) * 2012-07-18 2015-02-18 上海语天信息技术有限公司 Recursive and multilevel Chinese word segmentation method
US9152622B2 (en) 2012-11-26 2015-10-06 Language Weaver, Inc. Personalized machine translation via online adaptation
CN103871404B (en) * 2012-12-13 2017-04-12 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
IL224482B (en) * 2013-01-29 2018-08-30 Verint Systems Ltd System and method for keyword spotting using representative dictionary
US9396723B2 (en) * 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
CN104217717B (en) * 2013-05-29 2016-11-23 腾讯科技(深圳)有限公司 Build the method and device of language model
US9396724B2 (en) 2013-05-29 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
US9213694B2 (en) 2013-10-10 2015-12-15 Language Weaver, Inc. Efficient online domain adaptation
US9972311B2 (en) * 2014-05-07 2018-05-15 Microsoft Technology Licensing, Llc Language model optimization for in-domain application
US9953646B2 (en) 2014-09-02 2018-04-24 Belleau Technologies Method and system for dynamic speech recognition and tracking of prewritten script
US10409910B2 (en) * 2014-12-12 2019-09-10 Omni Ai, Inc. Perceptual associative memory for a neuro-linguistic behavior recognition system
US9734826B2 (en) 2015-03-11 2017-08-15 Microsoft Technology Licensing, Llc Token-level interpolation for class-based language models
KR101668725B1 (en) * 2015-03-18 2016-10-24 성균관대학교산학협력단 Latent keyparase generation method and apparatus
IL242218B (en) 2015-10-22 2020-11-30 Verint Systems Ltd System and method for maintaining a dynamic dictionary
IL242219B (en) 2015-10-22 2020-11-30 Verint Systems Ltd System and method for keyword searching using both static and dynamic dictionaries
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
US10607604B2 (en) * 2017-10-27 2020-03-31 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
CN110162681B (en) * 2018-10-08 2023-04-18 腾讯科技(深圳)有限公司 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium
CN110853628A (en) * 2019-11-18 2020-02-28 苏州思必驰信息科技有限公司 Model training method and device, electronic equipment and storage medium
US11893983B2 (en) * 2021-06-23 2024-02-06 International Business Machines Corporation Adding words to a prefix tree for improving speech recognition
CN113468308B (en) * 2021-06-30 2023-02-10 竹间智能科技(上海)有限公司 Conversation behavior classification method and device and electronic equipment

Family Cites Families (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4689768A (en) * 1982-06-30 1987-08-25 International Business Machines Corporation Spelling verification system with immediate operator alerts to non-matches between inputted words and words stored in plural dictionary memories
US4899148A (en) * 1987-02-25 1990-02-06 Oki Electric Industry Co., Ltd. Data compression method
US6231938B1 (en) * 1993-07-02 2001-05-15 Watkins Manufacturing Corporation Extruded multilayer polymeric shell having textured and marbled surface
US5621859A (en) * 1994-01-19 1997-04-15 Bbn Corporation Single tree method for grammar directed, very large vocabulary speech recognizer
US5926388A (en) * 1994-12-09 1999-07-20 Kimbrough; Thomas C. System and method for producing a three dimensional relief
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
JP3277792B2 (en) * 1996-01-31 2002-04-22 株式会社日立製作所 Data compression method and apparatus
FR2744817B1 (en) * 1996-02-08 1998-04-03 Ela Medical Sa ACTIVE IMPLANTABLE MEDICAL DEVICE AND ITS EXTERNAL PROGRAMMER WITH AUTOMATIC SOFTWARE UPDATE
US5822729A (en) * 1996-06-05 1998-10-13 Massachusetts Institute Of Technology Feature-based speech recognizer having probabilistic linguistic processor providing word matching based on the entire space of feature vectors
US5963893A (en) * 1996-06-28 1999-10-05 Microsoft Corporation Identification of words in Japanese text by a computer system
SE516189C2 (en) * 1996-07-03 2001-11-26 Ericsson Telefon Ab L M Method and apparatus for activating a user menu in a presentation means
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US6424722B1 (en) * 1997-01-13 2002-07-23 Micro Ear Technology, Inc. Portable system for programming hearing aids
US6449662B1 (en) * 1997-01-13 2002-09-10 Micro Ear Technology, Inc. System for programming hearing aids
DE19708183A1 (en) * 1997-02-28 1998-09-03 Philips Patentverwaltung Method for speech recognition with language model adaptation
US6684063B2 (en) * 1997-05-02 2004-01-27 Siemens Information & Communication Networks, Inc. Intergrated hearing aid for telecommunications devices
EP0932897B1 (en) 1997-06-26 2003-10-08 Koninklijke Philips Electronics N.V. A machine-organized method and a device for translating a word-organized source text into a word-organized target text
JPH1169495A (en) * 1997-07-18 1999-03-09 Koninkl Philips Electron Nv Hearing aid
JPH1169499A (en) * 1997-07-18 1999-03-09 Koninkl Philips Electron Nv Hearing aid, remote control device and system
JP3190859B2 (en) * 1997-07-29 2001-07-23 松下電器産業株式会社 CDMA radio transmitting apparatus and CDMA radio receiving apparatus
WO1999007302A1 (en) * 1997-08-07 1999-02-18 Natan Bauman Apparatus and method for an auditory stimulator
FI105874B (en) * 1997-08-12 2000-10-13 Nokia Mobile Phones Ltd Multiple mobile broadcasting
US6052657A (en) * 1997-09-09 2000-04-18 Dragon Systems, Inc. Text segmentation and identification of topic using language models
US6081629A (en) * 1997-09-17 2000-06-27 Browning; Denton R. Handheld scanner and accompanying remote access agent
US6076056A (en) * 1997-09-19 2000-06-13 Microsoft Corporation Speech recognition system for recognizing continuous and isolated speech
US6163769A (en) * 1997-10-02 2000-12-19 Microsoft Corporation Text-to-speech using clustered context-dependent phoneme-based units
US6674867B2 (en) * 1997-10-15 2004-01-06 Belltone Electronics Corporation Neurofuzzy based device for programmable hearing aids
US6219427B1 (en) * 1997-11-18 2001-04-17 Gn Resound As Feedback cancellation improvements
US6695943B2 (en) * 1997-12-18 2004-02-24 Softear Technologies, L.L.C. Method of manufacturing a soft hearing aid
US6366863B1 (en) * 1998-01-09 2002-04-02 Micro Ear Technology Inc. Portable hearing-related analysis system
US6023570A (en) * 1998-02-13 2000-02-08 Lattice Semiconductor Corp. Sequential and simultaneous manufacturing programming of multiple in-system programmable systems through a data network
US6545989B1 (en) * 1998-02-19 2003-04-08 Qualcomm Incorporated Transmit gating in a wireless communication system
US6104913A (en) * 1998-03-11 2000-08-15 Bell Atlantic Network Services, Inc. Personal area network for personal telephone services
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US6141641A (en) * 1998-04-15 2000-10-31 Microsoft Corporation Dynamically configurable acoustic model for speech recognition system
US6347148B1 (en) * 1998-04-16 2002-02-12 Dspfactory Ltd. Method and apparatus for feedback reduction in acoustic systems, particularly in hearing aids
US6351472B1 (en) * 1998-04-30 2002-02-26 Siemens Audiologische Technik Gmbh Serial bidirectional data transmission method for hearing devices by means of signals of different pulsewidths
US6137889A (en) * 1998-05-27 2000-10-24 Insonus Medical, Inc. Direct tympanic membrane excitation via vibrationally conductive assembly
US6188979B1 (en) * 1998-05-28 2001-02-13 Motorola, Inc. Method and apparatus for estimating the fundamental frequency of a signal
US6151645A (en) * 1998-08-07 2000-11-21 Gateway 2000, Inc. Computer communicates with two incompatible wireless peripherals using fewer transceivers
US6240193B1 (en) * 1998-09-17 2001-05-29 Sonic Innovations, Inc. Two line variable word length serial interface
US6061431A (en) * 1998-10-09 2000-05-09 Cisco Technology, Inc. Method for hearing loss compensation in telephony systems based on telephone number resolution
US6188976B1 (en) * 1998-10-23 2001-02-13 International Business Machines Corporation Apparatus and method for building domain-specific language models
US6838485B1 (en) * 1998-10-23 2005-01-04 Baker Hughes Incorporated Treatments for drill cuttings
US6265102B1 (en) * 1998-11-05 2001-07-24 Electric Fuel Limited (E.F.L.) Prismatic metal-air cells
JP4302326B2 (en) * 1998-11-30 2009-07-22 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Automatic classification of text
DE19858398C1 (en) * 1998-12-17 2000-03-02 Implex Hear Tech Ag Tinnitus treatment implant comprises a gas-tight biocompatible electroacoustic transducer for implantation in a mastoid cavity
US6208273B1 (en) * 1999-01-29 2001-03-27 Interactive Silicon, Inc. System and method for performing scalable embedded parallel data compression
DE19914993C1 (en) * 1999-04-01 2000-07-20 Implex Hear Tech Ag Fully implantable hearing system with telemetric sensor testing has measurement and wireless telemetry units on implant side for transmitting processed signal to external display/evaluation unit
DE19915846C1 (en) * 1999-04-08 2000-08-31 Implex Hear Tech Ag Partially implantable system for rehabilitating hearing trouble includes a cordless telemetry device to transfer data between an implantable part, an external unit and an energy supply.
US6094492A (en) * 1999-05-10 2000-07-25 Boesen; Peter V. Bone conduction voice transmission apparatus and system
US20020032564A1 (en) * 2000-04-19 2002-03-14 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
US6557029B2 (en) * 1999-06-28 2003-04-29 Micro Design Services, Llc System and method for distributing messages
US6490558B1 (en) * 1999-07-28 2002-12-03 Custom Speech Usa, Inc. System and method for improving the accuracy of a speech recognition program through repetitive training
US6590986B1 (en) * 1999-11-12 2003-07-08 Siemens Hearing Instruments, Inc. Patient-isolating programming interface for programming hearing aids
US6324907B1 (en) * 1999-11-29 2001-12-04 Microtronic A/S Flexible substrate transducer assembly
US6366880B1 (en) * 1999-11-30 2002-04-02 Motorola, Inc. Method and apparatus for suppressing acoustic background noise in a communication system by equaliztion of pre-and post-comb-filtered subband spectral energies
US6601093B1 (en) * 1999-12-01 2003-07-29 Ibm Corporation Address resolution in ad-hoc networking
JP2001169380A (en) * 1999-12-14 2001-06-22 Casio Comput Co Ltd Ear mount type music reproducing device, and music reproduction system
US6377925B1 (en) * 1999-12-16 2002-04-23 Interactive Solutions, Inc. Electronic translator for assisting communications
JP2001177596A (en) * 1999-12-20 2001-06-29 Toshiba Corp Communication equipment and communication method
JP2001177889A (en) * 1999-12-21 2001-06-29 Casio Comput Co Ltd Body mounted music reproducing device, and music reproduction system
US6584358B2 (en) * 2000-01-07 2003-06-24 Biowave Corporation Electro therapy method and apparatus
US6850775B1 (en) * 2000-02-18 2005-02-01 Phonak Ag Fitting-anlage
US20010033664A1 (en) * 2000-03-13 2001-10-25 Songbird Hearing, Inc. Hearing aid format selector
DE10018360C2 (en) * 2000-04-13 2002-10-10 Cochlear Ltd At least partially implantable system for the rehabilitation of a hearing impairment
DE10018361C2 (en) * 2000-04-13 2002-10-10 Cochlear Ltd At least partially implantable cochlear implant system for the rehabilitation of a hearing disorder
DE10018334C1 (en) * 2000-04-13 2002-02-28 Implex Hear Tech Ag At least partially implantable system for the rehabilitation of a hearing impairment
US20010049566A1 (en) * 2000-05-12 2001-12-06 Samsung Electronics Co., Ltd. Apparatus and method for controlling audio output in a mobile terminal
AU6814201A (en) * 2000-06-01 2001-12-11 Otologics Llc Method and apparatus for measuring the performance of an implantable middle ear hearing aid, and the response of patient wearing such a hearing aid
DE10031832C2 (en) * 2000-06-30 2003-04-30 Cochlear Ltd Hearing aid for the rehabilitation of a hearing disorder
DE10041726C1 (en) * 2000-08-25 2002-05-23 Implex Ag Hearing Technology I Implantable hearing system with means for measuring the coupling quality
US20020076073A1 (en) * 2000-12-19 2002-06-20 Taenzer Jon C. Automatically switched hearing aid communications earpiece
US6584356B2 (en) * 2001-01-05 2003-06-24 Medtronic, Inc. Downloadable software support in a pacemaker
US20020095892A1 (en) * 2001-01-09 2002-07-25 Johnson Charles O. Cantilevered structural support
US6582628B2 (en) * 2001-01-17 2003-06-24 Dupont Mitsui Fluorochemicals Conductive melt-processible fluoropolymer
US6590987B2 (en) * 2001-01-17 2003-07-08 Etymotic Research, Inc. Two-wired hearing aid system utilizing two-way communication for programming
US6823312B2 (en) * 2001-01-18 2004-11-23 International Business Machines Corporation Personalized system for providing improved understandability of received speech
US20020150219A1 (en) * 2001-04-12 2002-10-17 Jorgenson Joel A. Distributed audio system for the capture, conditioning and delivery of sound
US6913578B2 (en) * 2001-05-03 2005-07-05 Apherma Corporation Method for customizing audio systems for hearing impaired
US6944474B2 (en) * 2001-09-20 2005-09-13 Sound Id Sound enhancement for mobile phones and other products producing personalized audio for users
US20030128859A1 (en) * 2002-01-08 2003-07-10 International Business Machines Corporation System and method for audio enhancement of digital devices for hearing impaired
CN1243541C (en) * 2002-05-09 2006-03-01 中国医学科学院药物研究所 2-(alpha-hydroxypentyl) benzoate and its preparing process and usage

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100380370C (en) * 2003-04-30 2008-04-09 罗伯特·博世有限公司 Method for statistical language modeling in speech recognition
CN101266599B (en) * 2005-01-31 2010-07-21 日电(中国)有限公司 Input method and user terminal
CN1916889B (en) * 2005-08-19 2011-02-02 株式会社日立制作所 Language material storage preparation device and its method
CN101097488B (en) * 2006-06-30 2011-05-04 2012244安大略公司 Method for learning character fragments from received text and relevant hand-hold electronic equipments
CN105786796A (en) * 2008-04-16 2016-07-20 谷歌公司 Segmenting words using scaled probabilities
CN105786796B (en) * 2008-04-16 2019-02-22 谷歌有限责任公司 Divide word using scaled probability
CN103201707A (en) * 2010-09-29 2013-07-10 触摸式有限公司 System and method for inputting text into electronic devices
CN103201707B (en) * 2010-09-29 2017-09-29 触摸式有限公司 Text prediction engine from text to electronic equipment, system and method for inputting
CN103034628A (en) * 2011-10-27 2013-04-10 微软公司 Functionality for normalizing linguistic items
CN103034628B (en) * 2011-10-27 2015-12-02 微软技术许可有限责任公司 For by normalized for language program functional device
US10613746B2 (en) 2012-01-16 2020-04-07 Touchtype Ltd. System and method for inputting text
CN105159890A (en) * 2014-06-06 2015-12-16 谷歌公司 Generating representations of input sequences using neural networks
US10181098B2 (en) 2014-06-06 2019-01-15 Google Llc Generating representations of input sequences using neural networks
US11222252B2 (en) 2014-06-06 2022-01-11 Google Llc Generating representations of input sequences using neural networks
CN107111609A (en) * 2014-12-12 2017-08-29 全方位人工智能股份有限公司 Lexical analyzer for neural language performance identifying system
CN107111609B (en) * 2014-12-12 2021-02-26 全方位人工智能股份有限公司 Lexical analyzer for neural language behavior recognition system
CN107427732A (en) * 2016-12-09 2017-12-01 香港应用科技研究院有限公司 For the system and method for the data structure for organizing and handling feature based
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2001037128A2 (en) 2001-05-25
WO2001037128A3 (en) 2002-02-07
JP2003523559A (en) 2003-08-05
US6904402B1 (en) 2005-06-07
US20040210434A1 (en) 2004-10-21
AU4610401A (en) 2001-05-30
CN100430929C (en) 2008-11-05

Similar Documents

Publication Publication Date Title
CN100430929C (en) System and iterative method for lexicon, segmentation and language model joint optimization
US10650356B2 (en) Intelligent self-service delivery advisor
US11468233B2 (en) Intention identification method, intention identification apparatus, and computer-readable recording medium
CN114585999A (en) Multilingual code line completion system
JP4945086B2 (en) Statistical language model for logical forms
JP5484317B2 (en) Large-scale language model in machine translation
US7020587B1 (en) Method and apparatus for generating and managing a language model data structure
RU2336552C2 (en) Linguistically informed statistic models of structure of components for ordering in realisation of sentences for system of natural language generation
CN1161747C (en) Network interactive user interface using speech recognition and natural language processing
US20210035556A1 (en) Fine-tuning language models for supervised learning tasks via dataset preprocessing
US20070282594A1 (en) Machine translation in natural language application development
TW201717070A (en) Statistics-based machine translation method, apparatus and electronic device
CN1426561A (en) Computer-aided reading system and method with cross-languige reading wizard
KR101130457B1 (en) Extracting treelet translation pairs
WO2001037126A2 (en) A system and method for joint optimization of language model performance and size
CN1457041A (en) System for automatically suppying training data for natural language analyzing system
CN101065746A (en) System and method for automatic enrichment of documents
CN112256860A (en) Semantic retrieval method, system, equipment and storage medium for customer service conversation content
US20220108080A1 (en) Reinforcement Learning Techniques for Dialogue Management
CN1627300A (en) Learning and using generalized string patterns for information extraction
CN1750119A (en) Creating a speech recognition grammar for alphanumeric concepts
CN100351837C (en) Automatic resolution of segmentation ambiguities in grammar authoring
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
CN111328416B (en) Speech patterns for fuzzy matching in natural language processing
CN110890090A (en) Context-based auxiliary interaction control method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150506

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150506

Address after: Washington State

Patentee after: Micro soft technique license Co., Ltd

Address before: Washington, USA

Patentee before: Microsoft Corp.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081105

Termination date: 20181103