WO2001037128A2 - A system and iterative method for lexicon, segmentation and language model joint optimization - Google Patents
A system and iterative method for lexicon, segmentation and language model joint optimization Download PDFInfo
- Publication number
- WO2001037128A2 WO2001037128A2 PCT/US2000/041870 US0041870W WO0137128A2 WO 2001037128 A2 WO2001037128 A2 WO 2001037128A2 US 0041870 W US0041870 W US 0041870W WO 0137128 A2 WO0137128 A2 WO 0137128A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- lexicon
- language model
- corpus
- segmentation
- storage medium
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
Definitions
- This invention generally relates to language modeling and, more specifically, to a system and iterative method for lexicon, word segmentation and language model joint optimization.
- a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence.
- a common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of a textual corpus.
- the use of a prefix tree data structure enables a higher-level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above.
- the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item strings.
- a textual corpus is dissected into a plurality of items (characters, letters, numbers, etc.);
- the items e.g., characters (C)
- W are segmented (e.g., into words (W)) in accordance with a small, pre-defined lexicon and a simple, pre-defined segmentation algorithm, wherein each W is mapped in the tree to one or more C's;
- train a language model on the dissected corpus by counting the occurrence of strings of characters, from which the probability of a sequence of words (W 1 ⁇ W 2 , ...W M ) is predicted from the previous two words:
- the N-gram language model is limited in a number of respects.
- the counting process utilized in constructing the prefix tree is very time consuming.
- only small N-gram models typically bi-gram, or tri-gram
- the memory required to store the prefix tree increases by 2 N .
- the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-grams larger than three (i.e., a tri-gram).
- a fixed lexicon limits the ability of the model to select the best words in general or specific to a task. If a word is not in the lexicon, it does not exist as far as the model is concerned. Thus, a small lexicon is not likely to cover the intended linguistic content.
- segmentation algorithms are often ad-hoc and not based on any statistical or semantic principles.
- a simplistic segmentation algorithm typically errors in favor of larger words over smaller words.
- the model is unable to accurately predict smaller words contained within larger lexiconically acceptable strings.
- This invention concerns a system and iterative method for lexicon
- Fig. 1 is a block diagram of a computer system incorporating the teachings of the present invention
- Fig. 2 is a block diagram of an example modeling agent to iteratively develop a lexicon, segmentation and language model, according to one implementation of the present invention
- Fig. 3 is a graphical representation of a DOMM tree according to one aspect of the present invention.
- Fig. 4 is a flow chart of an example method for building a DOMM tree
- Fig. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention
- Fig. 6 is a flow chart detailing the method steps for generating an initial lexicon, and iteratively altering a dynamically generated lexicon, segmentation and language model until convergence, according to one implementation of the present invention.
- Fig. 7 is a storage medium with a plurality of executable instructions which, when executed, implement the innovative modeling agent of the present invention, according to an alternate embodiment of the present invention.
- This invention concerns a system and iterative method for lexicon, segmentation and language model joint optimization.
- an innovative language model the Dynamic Order Markov Model
- DOMM Language Model Data Structure
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- program modules may be located in both local and remote memory storage devices. It is noted, however, that modification to the architecture and methods described herein may well be made without deviating from spirit and scope of the present invention.
- Fig. 1 illustrates an example computer system 102 including an innovative language modeling agent 104, to jointly optimize a lexicon, segmentation and language model according to the teachings of the present invention.
- language modeling agent 104 may well be implemented as a function of an application, e.g., word processor, web browser, speech recognition system, etc.
- application e.g., word processor, web browser, speech recognition system, etc.
- innovative modeling agent may well be implemented in hardware, e.g., a programmable logic array (PLA), a special purpose processor, an application specific integrated circuit (ASIC), microcontroller, and the like.
- PLA programmable logic array
- ASIC application specific integrated circuit
- computer 102 is intended to represent any of a class of general or special purpose computing platforms which, when endowed with the innovative language modeling agent (LMA) 104, implement the teachings of the present invention in accordance with the first example implementation introduced above.
- LMA language modeling agent
- computer system 102 may alternatively support a hardware implementation of LMA 104 as well.
- LMA 104 the following description of computer system 102 is intended to be merely illustrative, as computer systems of greater or lesser capability may well be substituted without deviating from the spirit and scope of the present invention.
- computer 102 includes one or more processors or processing units 132, a system memory 134, and a bus 136 that couples various system components including the system memory 134 to processors 132.
- the bus 136 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- the system memory includes read only memory (ROM) 138 and random access memory (RAM) 140.
- ROM read only memory
- RAM random access memory
- a basic input/output system (BIOS) 142 containing the basic routines that help to transfer information between elements within computer 102, such as during start-up, is stored in ROM 138.
- Computer 102 further includes a hard disk drive 144 for reading from and writing to a hard disk, not shown, a magnetic disk drive 146 for reading from and writing to a removable magnetic disk 148, and an optical disk drive 150 for reading from or writing to a removable optical disk 152 such as a CD ROM, DVD ROM or other such optical media.
- the hard disk drive 144, magnetic disk drive 146, and optical disk drive 150 are connected to the bus 136 by a SCSI interface 154 or some other suitable bus interface.
- the drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for computer 102.
- a number of program modules may be stored on the hard disk 144, magnetic disk 148, optical disk 152, ROM 138, or RAM 140, including an operating system 158, one or more application programs 160 including, for example, the innovative LMA 104 incorporating the teachings of the present invention, other program modules 162, and program data 164 (e.g., resultant language model data structures, etc.).
- a user may enter commands and information into computer 102 through input devices such as keyboard 166 and pointing device 168.
- Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are connected to the processing unit 132 through an interface 170 that is coupled to bus 136.
- a monitor 172 or other type of display device is also connected to the bus 136 via an interface, such as a video adapter 174.
- personal computers often include other peripheral output devices (not shown) such as speakers and printers.
- computer 102 operates in a networked environment using logical connections to one or more remote computers, such as a remote computer 176.
- the remote computer 176 may be another personal computer, a personal digital assistant, a server, a router or other network device, a network "thin-client" PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer 102, although only a memory storage device 178 has been illustrated in Fig. 1.
- LAN local area network
- WAN wide area network
- Internet Internet Web browser program
- computer 102 When used in a LAN networking environment, computer 102 is connected to the local network 180 through a network interface or adapter 184.
- computer 102 When used in a WAN networking environment, computer 102 typically includes a modem 186 or other means for establishing communications over the wide area network 182, such as the Internet.
- the modem 186 which may be internal or external, is connected to the bus 136 via a input/output (I/O) interface 156.
- I/O interface 156 also supports one or more printers 188.
- program modules depicted relative to the personal computer 102, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
- the data processors of computer 102 are programmed by means of instructions stored at different times in the various computer-readable storage media of the computer. Programs and operating systems are typically distributed, for example, on floppy disks or CD-ROMs.
- the invention described herein includes these and other various types of computer-readable storage media when such media contain instructions or programs for implementing the innovative steps described below in conjunction with a microprocessor or other data processor.
- the invention also includes the computer itself when programmed according to the methods and techniques described below.
- certain sub-components of the computer may be programmed to perform the functions and steps described below.
- the invention includes such sub-components when they are programmed as described.
- the invention described herein includes data structures, described below, as embodied on various types of memory media.
- programs and other executable program components such as the operating system are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computer, and are executed by the data processor(s) of the computer.
- Fig. 2 illustrates a block diagram of an example language modeling agent
- language modeling agent 104 is comprised of one or more controllers 202, innovative analysis engine 204, storage/memory device(s) 206 and, optionally, one or more additional applications (e.g., graphical user interface, prediction application, verification application, estimation application, etc.) 208, each communicatively coupled as shown.
- additional applications e.g., graphical user interface, prediction application, verification application, estimation application, etc.
- LMA 104 may well be implemented as a function of a higher level application, e.g., a word processor, web browser, speech recognition system, or a language conversion system.
- controller(s) 202 of LMA 104 are responsive to one or more instructional commands from a parent application to selectively invoke the features of LMA 104.
- LMA 104 may well be implemented as a stand-alone language modeling tool, providing a user with a user interface (208) to selectively implement the features of LMA 104 discussed below.
- controller(s) 202 of LMA 104 selectively invoke one or more of the functions of analysis engine 204 to optimize a language model from a dynamically generated lexicon and segmentation algorithm.
- controller 202 is intended to represent any of a number of alternate control systems known in the art including, but not limited to, a microprocessor, a programmable logic array (PLA), a micro-machine, an application specific integrated circuit (ASIC) and the like.
- controller 202 is intended to represent a series of executable instructions to implement the control logic described above.
- the innovative analysis engine 204 is comprised a Markov probability calculator 212, a data structure generator 210 including a frequency calculation function 213, a dynamic lexicon generation function 214 and a dynamic segmention function 216, and a data structure memory manager 218.
- controller 202 selectively invokes an instance of the analysis engine 204 to develop, modify and optimize a statistical language model (SLM).
- SLM statistical language model
- analysis engine 204 develops a statistical language model data structure fundamentally based on the Markov transition probabilities between individual items (e.g., characters, letters, numbers, etc.) of a textual corpus (e.g., one or more sets of text).
- analysis engine 204 utilizes as much data (referred to as "context” or "order” as is available to calculate the probability of an item string.
- the language model of the present invention is aptly referred to as a Dynamic Order Markov Model (DOMM).
- DOMM Dynamic Order Markov Model
- controller 202 When invoked by controller 202 to establish a DOMM data structure, analysis engine 204 selectively invokes the data structure generator 210. In response, data structure generator 210 establishes a tree-like data structure comprised of a plurality of nodes (associated with each of the plurality of items) and denoting inter-node dependencies. As described above, the tree-like data structure is referred to herein as a DOMM data structure, or DOMM tree. Controller 202 receives the textual corpus and stores at least a subset of the textual corpus in memory 206 as a dynamic training set 222 from which the language model is to be developed. It will be appreciated that, in alternate embodiments, a predetermined training set may also be used.
- Frequency calculation function 213 identifies a frequency of occurrence for each item (character, letter, number, word, etc.) in the training set subset. Based on inter-node dependencies, data structure generator 210 assigns each item to an appropriate node of the DOMM tree, with an indication of the frequency value ( ) and a compare bit
- the Markov probability calculator 212 calculates the probability of an item (character, letter, number, etc.) from a context (j) of associated items. More specifically, according to the teachings of the present invention, the Markov probability of a particular item (C,) is dependent on as many previous characters as data "allows", in other words:
- the number of characters employed as context (j) by Markov probability calculator 212 is a "dynamic" quantity that is different for each sequence of characters C Volunteer C,_ ⁇ , C 1-2 , C 1-3 , etc.
- the number of characters relied upon for context (j) by Markov probability calculator 212 is dependent, at least in part, on a frequency value for each of the characters, i.e., the rate at which they appear throughout the corpus. More specifically, if in identifying the items of the corpus Markov probability calculator 212 does not identify at least a minimum occurrence frequency for a particular item, it may be "pruned" (i.e., removed) from the tree as being statistically irrelevant. According to one embodiment, the minimum frequency threshold is three (3).
- analysis engine 204 does not rely on a fixed lexicon or a simple segmentation algorithm (both of which tend to be error prone). Rather, analysis engine 204 selectively invokes a dynamic segmentation function 216 to segment items (characters or letters, for example) into strings (e.g., words). More precisely, segmentation function 216 segments the training set 222 into subsets (chunks) and calculates a cohesion score (i.e., a measure of the similarity between items within the subset). The segmentation and cohesion calculation is iteratively performed by segmentation function 216 until the cohesion score for each subset reaches a predetermined threshold.
- a cohesion score i.e., a measure of the similarity between items within the subset
- the lexicon generation function 214 is invoked to dynamically generate and maintain a lexicon 220 in memory 206.
- lexicon generation function 214 analyzes the segmentation results and generates a lexicon from item strings with a Markov transition probability that exceeds a threshold.
- lexicon generation function 214 develops a dynamic lexicon 220 from item strings which exceed a pre-determined Markov transition probability taken from one or more language models developed by analysis engine 204.
- analysis engine 204 dynamically generates a lexicon of statistically significant, statistically accurate item strings from one or more language models developed over a period of time.
- the lexicon 220 comprises a "virtual corpus" that Markov probability calculator 212 relies upon (in addition to the dynamic training set) in developing subsequent language models.
- data structure memory manager 218 When invoked to modify or utilize the DOMM language model data structure, analysis engine 204 selectively invokes an instance of data structure memory manager 218.
- data structure memory manager 218 utilizes system memory as well as extended memory to maintain the DOMM data structure. More specifically, as will be described in greater detail below with reference to Figs. 6 and 7, data structure memory manager 218 employs a WriteNode function and a ReadNode function (not shown) to maintain a subset of the most recently used nodes of the DOMM data structure in a first level cache 224 of a system memory 206, while relegating least recently used nodes to extended memory (e.g., disk files in hard drive 144, or some remote drive), to provide for improved performance characteristics.
- extended memory e.g., disk files in hard drive 144, or some remote drive
- a second level cache of system memory 206 is used to aggregate write commands until a predetermined threshold has been met, at which point data structure memory manager make one aggregate WriteNode command to an appropriate location in memory.
- data structure memory manager 218 may well be combined as a functional element of controller(s) 202 without deviating from the spirit and scope of the present invention.
- Fig. 3 graphically represents a conceptual illustration of an example Dynamic Order Markov Model tree-like data structure 300, according to the teachings of the present invention.
- Fig. 3 presents an example DOMM data structure 300 for a language model developed from the English alphabet, i.e., A, B, C, ...Z.
- the DOMM tree 300 is comprised of one or more root nodes 302 and one or more subordinate nodes 304, each associated with an item (character, letter, number, word, etc.) of a textual corpus, logically coupled to denote dependencies between nodes.
- root nodes 302 are comprised of an item and a frequency value (e.g., a count of how many times the item occurs in the corpus).
- a frequency value e.g., a count of how many times the item occurs in the corpus.
- the subordinate nodes are arranged in binary sub-trees, wherein each node includes a compare bit (b ; ), an item with which the node is associated
- a binary sub-tree is comprised of subordinate nodes 308-318 denoting the relationships between nodes and the frequency with which they occur.
- the complexity of a search of the DOMM tree approximates log(N), where N is the total number of nodes to be searched.
- DOMM tree 300 may exceed the space available in the memory device 206 of LMA 104 and/or the main memory 140 of computer system 102. Accordingly, data structure memory manager 218 facilitates storage of a DOMM tree data structure 300 across main memory (e.g., 140 and/or 206) into an extended memory space, e.g., disk files on a mass storage device such as hard drive 144 of computer system 102.
- main memory e.g., 140 and/or 206
- extended memory space e.g., disk files on a mass storage device such as hard drive 144 of computer system 102.
- Fig. 4 is a flow chart of an example method for building a Dynamic Order
- Markov Model data structure, according to one aspect of the present invention.
- language modeling agent 104 may be invoked directly by a user or a higher-level application.
- controller 202 of LMA 104 selectively invokes an instance of analysis engine 204, and a textual corpus (e.g., one or more documents) is loaded into memory 206 as a dynamic training set 222 and split into subsets (e.g., sentences, lines, etc.), block 402.
- data structure generator 210 assigns each item of the subset to a node in data structure and calculates a frequency value for the item, block 404.
- frequency calculation function 213 is invoked to identify the occurrence frequency of each item within the training set subset.
- data structure generator determines whether additional subsets of the training set remain and, if so, the next subset is read in block 408 and the process continues with block 404.
- data structure generator 210 completely populates the data structure, a subset at a time, before invocation of the frequency calculation function 213.
- frequency calculation function 213 simply counts each item as it is placed into associated nodes of the data structure.
- data structure generator 210 may optionally prune the data structure, block 410.
- a number of mechanisms may be employed to prune the resultant data structure 300.
- Fig. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention. As shown, the method begins with block 400 wherein LM 104 is invoked and a prefix tree of at least a subset of the received corpus is built. More specifically, as detailed in Fig. 4, data structure generator 210 of modeling agent 104 analyzes the received corpus and selects at least a subset as a training set, from which a DOMM tree is built.
- a very large lexicon is built form the prefix tree and pre- processed to remove some obvious illogical words. More specifically, lexicon generation function 214 is invoked to build an initial lexicon from the prefix tree. According to one implementation, the initial lexicon is built from the prefix tree using all sub-strings whose length is less than some pre-defined value, say ten (10) items (i.e., the sub-string is ten nodes or less from root to the most subordinate node). Once the initial lexicon is compiled, lexicon generation function 214 prunes the lexicon by removing some obvious illogical words (see, e.g., block 604, below). According to one implementation, lexicon generation function 214 appends a predefined lexicon with the new, initial lexicon generated from at least the training set of the received corpus.
- some pre-defined value say ten (10) items
- At least the training set of the received corpus is segmented, using the initial lexicon. More particularly, dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an initial segmented corpus.
- dynamic segmentation function 216 utilizes a Maximum Match technique to provide an initial segmented corpus.
- segmentation function 216 starts at the beginning of an item string (or branch of the DOMM tree) and searches lexicon to see if the initial item (li) is a one-item "word”. Segmentation function then combines it with the next item in the string to see if the combination (e.g., Ijl 2 ) is found as a "word” in the lexicon, and so on. According to one implementation, the longest string (I l5 I 2 , ...I N ) of items found in the lexicon is deemed to be the correct segmentation for that string. It is to be appreciated that more complex Maximum Match algorithms may well be utilized by segmentation function 216 without deviating from the scope and spirit of the present invention.
- an iterative process is entered wherein the lexicon, segmentation and language model are jointly optimized, block 506. More specifically, as will be shown in greater detail below, the innovative iterative optimization employs a statistical language modeling approach to dynamically adjust the segmentation and lexicon to provide an optimized language model. That is, unlike prior art language modeling techniques, modeling agent 104 does not rely on a pre-defined static lexicon, or simplistic segmentation algorithm to generate a language model. Rather, modeling agent 104 utilizes the received corpus, or at least a subset thereof (training set), to dynamically generate a lexicon and segmentation to produce an optimized language model. In this regard, language models generated by modeling agent 104 do not suffer from the drawbacks and limitations commonly associated with prior art modeling systems.
- Fig. 6 presents a more detailed flow chart for generating an initial lexicon, and the iterative process of refining the lexicon and segmentation to optimize the language model, according to one implementation of the present invention.
- the method begins with step 400 (Fig. 4) of building a prefix tree from the received corpus.
- the prefix tree may be built using the entire corpus or, alternatively, using a subset entire corpus (referred to as a training corpus).
- the process of generating an initial lexicon begins with block
- lexicon generation function 214 generates an initial lexicon from the prefix tree by identifying substrings (or branches of the prefix tree) with less than a select number of items. According to one implementation, lexicon generation function 214 identifies substrings of ten (10) items or less to comprise the initial lexicon. In block 604, lexicon generation function 214 analyzes the initial lexicon generated in step 602 for obvious illogical substrings, removing these substrings from the initial lexicon. That is, lexicon generation function 214 analyzes the initial lexicon of substrings for illogical, or improbable words and removes these words from the lexicon.
- dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an segmented corpus.
- the Maximum Match algorithm is used to segment based on the initial lexicon.
- the frequency analysis function 213 is invoked to compute the frequency of the occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon.
- the threshold for this deletion and re-segmentation may be determined according to the size of the corpus.
- a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon. In this way, we can delete most of the obvious illogical words from the initial lexicon.
- the received corpus is segmented based, at least in part, on the initial lexicon, block 504.
- the initial segmentation of the corpus is performed using a maximum matching process.
- the iterative process of dynamically altering the lexicon and segmentation begins to optimize a statistical language model (SLM) from the received corpus (or training set), block 506.
- SLM statistical language model
- the process begins in block 606, wherein the Markov probability calculator 212 utilizes the initial lexicon and_segmentation to begin language model training using the segmented corpus. That is, given the initial lexicon and an initial segmentation, a statistical language model may be generated therefrom.
- the language model does not yet benefit from a refined lexicon and a statistically based segmentation (which will evolve in the steps to follow), it is nonetheless fundamentally based on the received corpus itself.
- the initial language model does not yet benefit from a refined lexicon and a statistically based segmentation (which will evolve in the steps to follow), it is nonetheless fundamentally based on the received corpus itself.
- the initial language model does not yet benefit from a refined lexicon and a statistically based segmentation (which will evolve in the steps to follow
- the segmented corpus (or training set) is re-segmented using SLM-based segmentation.
- SLM-based segmentation Given a sentence wl, w2, ...wn, there are M possible ways to segment it (where M ⁇ l).
- Dynamic segmentation function 216 computes a probability (p,) of each segmentation (S,) based on an N-gram statistical language model.
- a Viterbi search algorithm is employed to find the most probable segmentation S k , where:
- the lexicon is updated using the re-segmented corpus resulting from the SLM-based segmentation described above.
- modeling agent 104 invokes frequency analysis function 213 to compute the frequency of occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon. All occurrences of the word must then be re-segmented into smaller words, as the uni- count for all those words are re-computed.
- the threshold for this deletion and re- segmentation may be determined according to the size of the corpus. According to one implementation, a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon.
- the language model is updated to reflect the dynamically generated lexicon and the SLM-based segmentation, and a measure of the language model perplexity (i.e., an inverse probability measure) is computer by Markov probability calculator 212. If the perplexity continues to converge (toward zero (0)), i.e., improve, the process continues with block 608 wherein the lexicon and segmentation are once again modified with the intent of further improving the language model performance (as measured by perplexity). If in block 614 it is determined that the language model has not improved as a result of the recent modifications to the lexicon and segmentation, a further determination of whether the perplexity has reached an acceptable threshold is made, block 616. If so, the process ends.
- a measure of the language model perplexity i.e., an inverse probability measure
- lexicon generation function 214 deletes the word with the smallest frequency of occurrence in the corpus from the lexicon, re-segmenting the word into smaller words, block 618, as the process continues with block 610.
- innovative language modeling agent 104 generates an optimized language model premised on a dynamically generated lexicon and segmentation rules statistically predicated on at least a subset of the received corpus.
- the resultant language model has improved computational and predictive capability when compared to prior art language models.
- FIG. 7 is a block diagram of a storage medium having stored thereon a plurality of instructions including instructions to implement the innovative modeling agent of the present invention, according to yet another embodiment of the present invention.
- Fig. 7 illustrates a storage medium/device 700 having stored thereon a plurality of executable instructions 702 including at least a subset of which that, when executed, implement the innovative modeling agent 104 of the present invention.
- the executable instructions 702 When executed by a processor of a host system, the executable instructions 702 implement the modeling agent to generate a statistical language model representation of a textual corpus for use by any of a host of other applications executing on or otherwise available to the host system.
- storage medium 700 is intended to represent any of a number of storage devices and/or storage media known to those skilled in the art such as, for example, volatile memory devices, non-volatile memory devices, magnetic storage media, optical storage media, and the like.
- the executable instructions are intended to reflect any of a number of software languages known in the art such as, for example, C++, Visual Basic, Hypertext Markup Language (HTML), Java, extensible Markup Language (XML), and the like.
- the storage medium/device 700 need not be co-located with any host system. That is, storage medium/device 700 may well reside within a remote server communicatively coupled to and accessible by an executing system. Accordingly, the software implementation of Fig. 7 is to be regarded as illustrative, as alternate storage media and software embodiments are anticipated within the spirit and scope of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2001539153A JP2003523559A (ja) | 1999-11-05 | 2000-11-03 | 辞典、セグメンテーションおよび言語モデルを同時最適化するためのシステムおよび反復的方法 |
| AU46104/01A AU4610401A (en) | 1999-11-05 | 2000-11-03 | A system and iterative method for lexicon, segmentation and language model joint optimization |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16385099P | 1999-11-05 | 1999-11-05 | |
| US60/163,850 | 1999-11-05 | ||
| US09/609,202 | 2000-06-30 | ||
| US09/609,202 US6904402B1 (en) | 1999-11-05 | 2000-06-30 | System and iterative method for lexicon, segmentation and language model joint optimization |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2001037128A2 true WO2001037128A2 (en) | 2001-05-25 |
| WO2001037128A3 WO2001037128A3 (en) | 2002-02-07 |
Family
ID=26860000
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2000/041870 Ceased WO2001037128A2 (en) | 1999-11-05 | 2000-11-03 | A system and iterative method for lexicon, segmentation and language model joint optimization |
Country Status (5)
| Country | Link |
|---|---|
| US (2) | US6904402B1 (enExample) |
| JP (1) | JP2003523559A (enExample) |
| CN (1) | CN100430929C (enExample) |
| AU (1) | AU4610401A (enExample) |
| WO (1) | WO2001037128A2 (enExample) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1623412A4 (en) * | 2003-04-30 | 2008-03-19 | Bosch Gmbh Robert | METHOD FOR STATISTICAL LANGUAGE MODELING FOR VOICE RECOGNITION |
| CN102799676A (zh) * | 2012-07-18 | 2012-11-28 | 上海语天信息技术有限公司 | 一种递归多层次中文分词方法 |
| US10181098B2 (en) | 2014-06-06 | 2019-01-15 | Google Llc | Generating representations of input sequences using neural networks |
| US11847413B2 (en) | 2014-12-12 | 2023-12-19 | Intellective Ai, Inc. | Lexical analyzer for a neuro-linguistic behavior recognition system |
| US12032909B2 (en) | 2014-12-12 | 2024-07-09 | Intellective Ai, Inc. | Perceptual associative memory for a neuro-linguistic behavior recognition system |
Families Citing this family (110)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7750891B2 (en) | 2003-04-09 | 2010-07-06 | Tegic Communications, Inc. | Selective input system based on tracking of motion parameters of an input device |
| AU5299700A (en) * | 1999-05-27 | 2000-12-18 | America Online, Inc. | Keyboard system with automatic correction |
| US7821503B2 (en) * | 2003-04-09 | 2010-10-26 | Tegic Communications, Inc. | Touch screen and graphical user interface |
| US7030863B2 (en) * | 2000-05-26 | 2006-04-18 | America Online, Incorporated | Virtual keyboard system with automatic correction |
| US7286115B2 (en) * | 2000-05-26 | 2007-10-23 | Tegic Communications, Inc. | Directional input system with automatic correction |
| US20050044148A1 (en) * | 2000-06-29 | 2005-02-24 | Microsoft Corporation | Method and system for accessing multiple types of electronic content |
| US7020587B1 (en) * | 2000-06-30 | 2006-03-28 | Microsoft Corporation | Method and apparatus for generating and managing a language model data structure |
| CN1226717C (zh) * | 2000-08-30 | 2005-11-09 | 国际商业机器公司 | 自动新词提取方法和系统 |
| DE60029456T2 (de) * | 2000-12-11 | 2007-07-12 | Sony Deutschland Gmbh | Verfahren zur Online-Anpassung von Aussprachewörterbüchern |
| US7177792B2 (en) * | 2001-05-31 | 2007-02-13 | University Of Southern California | Integer programming decoder for machine translation |
| WO2003005344A1 (en) * | 2001-07-03 | 2003-01-16 | Intel Zao | Method and apparatus for dynamic beam control in viterbi search |
| WO2003005166A2 (en) | 2001-07-03 | 2003-01-16 | University Of Southern California | A syntax-based statistical translation model |
| JP2003036088A (ja) * | 2001-07-23 | 2003-02-07 | Canon Inc | 音声変換の辞書管理装置 |
| US7620538B2 (en) * | 2002-03-26 | 2009-11-17 | University Of Southern California | Constructing a translation lexicon from comparable, non-parallel corpora |
| CA2411227C (en) * | 2002-07-03 | 2007-01-09 | 2012244 Ontario Inc. | System and method of creating and using compact linguistic data |
| EP1627325B1 (en) * | 2003-05-28 | 2011-07-27 | LOQUENDO SpA | Automatic segmentation of texts comprising chunks without separators |
| US7711545B2 (en) * | 2003-07-02 | 2010-05-04 | Language Weaver, Inc. | Empirical methods for splitting compound words with application to machine translation |
| US8548794B2 (en) | 2003-07-02 | 2013-10-01 | University Of Southern California | Statistical noun phrase translation |
| US7941310B2 (en) * | 2003-09-09 | 2011-05-10 | International Business Machines Corporation | System and method for determining affixes of words |
| US7698125B2 (en) * | 2004-03-15 | 2010-04-13 | Language Weaver, Inc. | Training tree transducers for probabilistic operations |
| US8296127B2 (en) * | 2004-03-23 | 2012-10-23 | University Of Southern California | Discovery of parallel text portions in comparable collections of corpora and training using comparable texts |
| US8666725B2 (en) | 2004-04-16 | 2014-03-04 | University Of Southern California | Selection and use of nonstatistical translation components in a statistical machine translation framework |
| JP5452868B2 (ja) | 2004-10-12 | 2014-03-26 | ユニヴァーシティー オブ サザン カリフォルニア | トレーニングおよび復号のためにストリングからツリーへの変換を使うテキスト‐テキスト・アプリケーションのためのトレーニング |
| PT1666074E (pt) | 2004-11-26 | 2008-08-22 | Ba Ro Gmbh & Co Kg | Lâmpada de desinfecção |
| CN100530171C (zh) * | 2005-01-31 | 2009-08-19 | 日电(中国)有限公司 | 字典学习方法和字典学习装置 |
| CN101266599B (zh) * | 2005-01-31 | 2010-07-21 | 日电(中国)有限公司 | 输入方法和用户终端装置 |
| CN101124579A (zh) * | 2005-02-24 | 2008-02-13 | 富士施乐株式会社 | 单词翻译装置、翻译方法以及翻译程序 |
| US7996219B2 (en) * | 2005-03-21 | 2011-08-09 | At&T Intellectual Property Ii, L.P. | Apparatus and method for model adaptation for spoken language understanding |
| US8676563B2 (en) | 2009-10-01 | 2014-03-18 | Language Weaver, Inc. | Providing human-generated and machine-generated trusted translations |
| US8886517B2 (en) | 2005-06-17 | 2014-11-11 | Language Weaver, Inc. | Trust scoring for language translation systems |
| US7974833B2 (en) | 2005-06-21 | 2011-07-05 | Language Weaver, Inc. | Weighted system of expressing language information using a compact notation |
| US7389222B1 (en) | 2005-08-02 | 2008-06-17 | Language Weaver, Inc. | Task parallelization in a text-to-text system |
| US7813918B2 (en) * | 2005-08-03 | 2010-10-12 | Language Weaver, Inc. | Identifying documents which form translated pairs, within a document collection |
| CN1916889B (zh) * | 2005-08-19 | 2011-02-02 | 株式会社日立制作所 | 语料库制作装置及其方法 |
| US7624020B2 (en) * | 2005-09-09 | 2009-11-24 | Language Weaver, Inc. | Adapter for allowing both online and offline training of a text to text system |
| US20070078644A1 (en) * | 2005-09-30 | 2007-04-05 | Microsoft Corporation | Detecting segmentation errors in an annotated corpus |
| US7328199B2 (en) * | 2005-10-07 | 2008-02-05 | Microsoft Corporation | Componentized slot-filling architecture |
| US10319252B2 (en) | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
| US20070106496A1 (en) * | 2005-11-09 | 2007-05-10 | Microsoft Corporation | Adaptive task framework |
| US7606700B2 (en) * | 2005-11-09 | 2009-10-20 | Microsoft Corporation | Adaptive task framework |
| US7941418B2 (en) * | 2005-11-09 | 2011-05-10 | Microsoft Corporation | Dynamic corpus generation |
| US7822699B2 (en) * | 2005-11-30 | 2010-10-26 | Microsoft Corporation | Adaptive semantic reasoning engine |
| US7831585B2 (en) * | 2005-12-05 | 2010-11-09 | Microsoft Corporation | Employment of task framework for advertising |
| US7933914B2 (en) | 2005-12-05 | 2011-04-26 | Microsoft Corporation | Automatic task creation and execution using browser helper objects |
| US20070130134A1 (en) * | 2005-12-05 | 2007-06-07 | Microsoft Corporation | Natural-language enabling arbitrary web forms |
| US7835911B2 (en) * | 2005-12-30 | 2010-11-16 | Nuance Communications, Inc. | Method and system for automatically building natural language understanding models |
| US20090006092A1 (en) * | 2006-01-23 | 2009-01-01 | Nec Corporation | Speech Recognition Language Model Making System, Method, and Program, and Speech Recognition System |
| US8296123B2 (en) * | 2006-02-17 | 2012-10-23 | Google Inc. | Encoding and adaptive, scalable accessing of distributed models |
| US20070203869A1 (en) * | 2006-02-28 | 2007-08-30 | Microsoft Corporation | Adaptive semantic platform architecture |
| US7996783B2 (en) * | 2006-03-02 | 2011-08-09 | Microsoft Corporation | Widget searching utilizing task framework |
| US8943080B2 (en) | 2006-04-07 | 2015-01-27 | University Of Southern California | Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections |
| US20070271087A1 (en) * | 2006-05-18 | 2007-11-22 | Microsoft Corporation | Language-independent language model using character classes |
| US7558725B2 (en) * | 2006-05-23 | 2009-07-07 | Lexisnexis, A Division Of Reed Elsevier Inc. | Method and apparatus for multilingual spelling corrections |
| EP2026327A4 (en) * | 2006-05-31 | 2012-03-07 | Nec Corp | LANGUAGE MODEL LEARNING, LANGUAGE MODEL LEARNING AND LANGUAGE MODEL LEARNING PROGRAM |
| CN101097488B (zh) * | 2006-06-30 | 2011-05-04 | 2012244安大略公司 | 从接收的文本中学习字符片段的方法及相关手持电子设备 |
| US8886518B1 (en) | 2006-08-07 | 2014-11-11 | Language Weaver, Inc. | System and method for capitalizing machine translated text |
| US8433556B2 (en) | 2006-11-02 | 2013-04-30 | University Of Southern California | Semi-supervised training for statistical word alignment |
| US9122674B1 (en) | 2006-12-15 | 2015-09-01 | Language Weaver, Inc. | Use of annotations in statistical machine translation |
| US8468149B1 (en) | 2007-01-26 | 2013-06-18 | Language Weaver, Inc. | Multi-lingual online community |
| US8201087B2 (en) | 2007-02-01 | 2012-06-12 | Tegic Communications, Inc. | Spell-check for a keyboard system with automatic correction |
| US8225203B2 (en) | 2007-02-01 | 2012-07-17 | Nuance Communications, Inc. | Spell-check for a keyboard system with automatic correction |
| US9465791B2 (en) * | 2007-02-09 | 2016-10-11 | International Business Machines Corporation | Method and apparatus for automatic detection of spelling errors in one or more documents |
| US8615389B1 (en) | 2007-03-16 | 2013-12-24 | Language Weaver, Inc. | Generation and exploitation of an approximate language model |
| US8831928B2 (en) | 2007-04-04 | 2014-09-09 | Language Weaver, Inc. | Customizable machine translation service |
| US8825466B1 (en) | 2007-06-08 | 2014-09-02 | Language Weaver, Inc. | Modification of annotated bilingual segment pairs in syntax-based machine translation |
| US7917355B2 (en) * | 2007-08-23 | 2011-03-29 | Google Inc. | Word detection |
| US8010341B2 (en) * | 2007-09-13 | 2011-08-30 | Microsoft Corporation | Adding prototype information into probabilistic models |
| US8521516B2 (en) * | 2008-03-26 | 2013-08-27 | Google Inc. | Linguistic key normalization |
| US8046222B2 (en) * | 2008-04-16 | 2011-10-25 | Google Inc. | Segmenting words using scaled probabilities |
| US8353008B2 (en) * | 2008-05-19 | 2013-01-08 | Yahoo! Inc. | Authentication detection |
| US9411800B2 (en) * | 2008-06-27 | 2016-08-09 | Microsoft Technology Licensing, Llc | Adaptive generation of out-of-dictionary personalized long words |
| US8301437B2 (en) * | 2008-07-24 | 2012-10-30 | Yahoo! Inc. | Tokenization platform |
| US8462123B1 (en) * | 2008-10-21 | 2013-06-11 | Google Inc. | Constrained keyboard organization |
| CN101430680B (zh) | 2008-12-31 | 2011-01-19 | 阿里巴巴集团控股有限公司 | 一种无词边界标记语言文本的分词序列选择方法及系统 |
| GB201016385D0 (en) * | 2010-09-29 | 2010-11-10 | Touchtype Ltd | System and method for inputting text into electronic devices |
| US8326599B2 (en) * | 2009-04-21 | 2012-12-04 | Xerox Corporation | Bi-phrase filtering for statistical machine translation |
| US8990064B2 (en) | 2009-07-28 | 2015-03-24 | Language Weaver, Inc. | Translating documents based on content |
| US8380486B2 (en) | 2009-10-01 | 2013-02-19 | Language Weaver, Inc. | Providing machine-generated translations and corresponding trust levels |
| US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
| GB201200643D0 (en) | 2012-01-16 | 2012-02-29 | Touchtype Ltd | System and method for inputting text |
| US11003838B2 (en) | 2011-04-18 | 2021-05-11 | Sdl Inc. | Systems and methods for monitoring post translation editing |
| WO2012145519A1 (en) * | 2011-04-20 | 2012-10-26 | Robert Bosch Gmbh | Speech recognition using multiple language models |
| US8694303B2 (en) | 2011-06-15 | 2014-04-08 | Language Weaver, Inc. | Systems and methods for tuning parameters in statistical machine translation |
| US8886515B2 (en) | 2011-10-19 | 2014-11-11 | Language Weaver, Inc. | Systems and methods for enhancing machine translation post edit review processes |
| CN103034628B (zh) * | 2011-10-27 | 2015-12-02 | 微软技术许可有限责任公司 | 用于将语言项目规范化的功能装置 |
| US8942973B2 (en) | 2012-03-09 | 2015-01-27 | Language Weaver, Inc. | Content page URL translation |
| US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
| US9152622B2 (en) | 2012-11-26 | 2015-10-06 | Language Weaver, Inc. | Personalized machine translation via online adaptation |
| CN103871404B (zh) * | 2012-12-13 | 2017-04-12 | 北京百度网讯科技有限公司 | 一种语言模型的训练方法、查询方法和对应装置 |
| IL224482B (en) * | 2013-01-29 | 2018-08-30 | Verint Systems Ltd | System and method for keyword spotting using representative dictionary |
| US9396723B2 (en) * | 2013-02-01 | 2016-07-19 | Tencent Technology (Shenzhen) Company Limited | Method and device for acoustic language model training |
| US9396724B2 (en) | 2013-05-29 | 2016-07-19 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for building a language model |
| CN104217717B (zh) * | 2013-05-29 | 2016-11-23 | 腾讯科技(深圳)有限公司 | 构建语言模型的方法及装置 |
| US9213694B2 (en) | 2013-10-10 | 2015-12-15 | Language Weaver, Inc. | Efficient online domain adaptation |
| US9972311B2 (en) * | 2014-05-07 | 2018-05-15 | Microsoft Technology Licensing, Llc | Language model optimization for in-domain application |
| US9953646B2 (en) | 2014-09-02 | 2018-04-24 | Belleau Technologies | Method and system for dynamic speech recognition and tracking of prewritten script |
| US9734826B2 (en) | 2015-03-11 | 2017-08-15 | Microsoft Technology Licensing, Llc | Token-level interpolation for class-based language models |
| KR101668725B1 (ko) * | 2015-03-18 | 2016-10-24 | 성균관대학교산학협력단 | 잠재 키워드 생성 방법 및 장치 |
| IL242218B (en) | 2015-10-22 | 2020-11-30 | Verint Systems Ltd | A system and method for maintaining a dynamic dictionary |
| IL242219B (en) | 2015-10-22 | 2020-11-30 | Verint Systems Ltd | System and method for keyword searching using both static and dynamic dictionaries |
| CN107427732B (zh) * | 2016-12-09 | 2021-01-29 | 香港应用科技研究院有限公司 | 用于组织和处理基于特征的数据结构的系统和方法 |
| CN109408794A (zh) * | 2017-08-17 | 2019-03-01 | 阿里巴巴集团控股有限公司 | 一种频次词典建立方法、分词方法、服务器和客户端设备 |
| US10607604B2 (en) * | 2017-10-27 | 2020-03-31 | International Business Machines Corporation | Method for re-aligning corpus and improving the consistency |
| CN110162681B (zh) * | 2018-10-08 | 2023-04-18 | 腾讯科技(深圳)有限公司 | 文本识别、文本处理方法、装置、计算机设备和存储介质 |
| CN110853628A (zh) * | 2019-11-18 | 2020-02-28 | 苏州思必驰信息科技有限公司 | 一种模型训练方法、装置、电子设备及存储介质 |
| CN111951788A (zh) * | 2020-08-10 | 2020-11-17 | 百度在线网络技术(北京)有限公司 | 一种语言模型的优化方法、装置、电子设备及存储介质 |
| US11893983B2 (en) * | 2021-06-23 | 2024-02-06 | International Business Machines Corporation | Adding words to a prefix tree for improving speech recognition |
| CN113468308B (zh) * | 2021-06-30 | 2023-02-10 | 竹间智能科技(上海)有限公司 | 一种对话行为分类方法及装置、电子设备 |
| CN115761886B (zh) * | 2022-11-18 | 2025-07-22 | 西安电子科技大学 | 基于自然语言知识描述引导的可解释性行为识别方法 |
| CN117351963A (zh) * | 2023-11-21 | 2024-01-05 | 京东城市(北京)数字科技有限公司 | 用于语音识别的方法、装置、设备和可读介质 |
Family Cites Families (83)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4689768A (en) * | 1982-06-30 | 1987-08-25 | International Business Machines Corporation | Spelling verification system with immediate operator alerts to non-matches between inputted words and words stored in plural dictionary memories |
| US4899148A (en) * | 1987-02-25 | 1990-02-06 | Oki Electric Industry Co., Ltd. | Data compression method |
| US6231938B1 (en) * | 1993-07-02 | 2001-05-15 | Watkins Manufacturing Corporation | Extruded multilayer polymeric shell having textured and marbled surface |
| US5621859A (en) * | 1994-01-19 | 1997-04-15 | Bbn Corporation | Single tree method for grammar directed, very large vocabulary speech recognizer |
| US5926388A (en) * | 1994-12-09 | 1999-07-20 | Kimbrough; Thomas C. | System and method for producing a three dimensional relief |
| US5806021A (en) * | 1995-10-30 | 1998-09-08 | International Business Machines Corporation | Automatic segmentation of continuous text using statistical approaches |
| JP3277792B2 (ja) * | 1996-01-31 | 2002-04-22 | 株式会社日立製作所 | データ圧縮方法および装置 |
| FR2744817B1 (fr) * | 1996-02-08 | 1998-04-03 | Ela Medical Sa | Dispositif medical implantable actif et son programmateur externe a mise a jour automatique du logiciel |
| US5822729A (en) * | 1996-06-05 | 1998-10-13 | Massachusetts Institute Of Technology | Feature-based speech recognizer having probabilistic linguistic processor providing word matching based on the entire space of feature vectors |
| US5963893A (en) * | 1996-06-28 | 1999-10-05 | Microsoft Corporation | Identification of words in Japanese text by a computer system |
| SE516189C2 (sv) * | 1996-07-03 | 2001-11-26 | Ericsson Telefon Ab L M | Förfarande och anordning för aktivering av en användarmeny i ett presentationsorgan |
| US5905972A (en) * | 1996-09-30 | 1999-05-18 | Microsoft Corporation | Prosodic databases holding fundamental frequency templates for use in speech synthesis |
| US6449662B1 (en) * | 1997-01-13 | 2002-09-10 | Micro Ear Technology, Inc. | System for programming hearing aids |
| US6424722B1 (en) * | 1997-01-13 | 2002-07-23 | Micro Ear Technology, Inc. | Portable system for programming hearing aids |
| DE19708183A1 (de) * | 1997-02-28 | 1998-09-03 | Philips Patentverwaltung | Verfahren zur Spracherkennung mit Sprachmodellanpassung |
| US6684063B2 (en) * | 1997-05-02 | 2004-01-27 | Siemens Information & Communication Networks, Inc. | Intergrated hearing aid for telecommunications devices |
| JP2000516749A (ja) | 1997-06-26 | 2000-12-12 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 語構成源テキストを語構成目標テキストに翻訳する機械構成の方法及び装置 |
| JPH1169499A (ja) * | 1997-07-18 | 1999-03-09 | Koninkl Philips Electron Nv | 補聴器、リモート制御装置及びシステム |
| JPH1169495A (ja) * | 1997-07-18 | 1999-03-09 | Koninkl Philips Electron Nv | 補聴器 |
| JP3190859B2 (ja) * | 1997-07-29 | 2001-07-23 | 松下電器産業株式会社 | Cdma無線送信装置及びcdma無線受信装置 |
| WO1999007302A1 (en) * | 1997-08-07 | 1999-02-18 | Natan Bauman | Apparatus and method for an auditory stimulator |
| FI105874B (fi) * | 1997-08-12 | 2000-10-13 | Nokia Mobile Phones Ltd | Monipistematkaviestinlähetys |
| US6052657A (en) * | 1997-09-09 | 2000-04-18 | Dragon Systems, Inc. | Text segmentation and identification of topic using language models |
| US6081629A (en) * | 1997-09-17 | 2000-06-27 | Browning; Denton R. | Handheld scanner and accompanying remote access agent |
| US6076056A (en) * | 1997-09-19 | 2000-06-13 | Microsoft Corporation | Speech recognition system for recognizing continuous and isolated speech |
| US6163769A (en) * | 1997-10-02 | 2000-12-19 | Microsoft Corporation | Text-to-speech using clustered context-dependent phoneme-based units |
| US6674867B2 (en) * | 1997-10-15 | 2004-01-06 | Belltone Electronics Corporation | Neurofuzzy based device for programmable hearing aids |
| US6219427B1 (en) * | 1997-11-18 | 2001-04-17 | Gn Resound As | Feedback cancellation improvements |
| US6695943B2 (en) * | 1997-12-18 | 2004-02-24 | Softear Technologies, L.L.C. | Method of manufacturing a soft hearing aid |
| US6366863B1 (en) * | 1998-01-09 | 2002-04-02 | Micro Ear Technology Inc. | Portable hearing-related analysis system |
| US6023570A (en) * | 1998-02-13 | 2000-02-08 | Lattice Semiconductor Corp. | Sequential and simultaneous manufacturing programming of multiple in-system programmable systems through a data network |
| US6545989B1 (en) * | 1998-02-19 | 2003-04-08 | Qualcomm Incorporated | Transmit gating in a wireless communication system |
| US6104913A (en) * | 1998-03-11 | 2000-08-15 | Bell Atlantic Network Services, Inc. | Personal area network for personal telephone services |
| US6418431B1 (en) * | 1998-03-30 | 2002-07-09 | Microsoft Corporation | Information retrieval and speech recognition based on language models |
| US6141641A (en) * | 1998-04-15 | 2000-10-31 | Microsoft Corporation | Dynamically configurable acoustic model for speech recognition system |
| US6347148B1 (en) * | 1998-04-16 | 2002-02-12 | Dspfactory Ltd. | Method and apparatus for feedback reduction in acoustic systems, particularly in hearing aids |
| US6351472B1 (en) * | 1998-04-30 | 2002-02-26 | Siemens Audiologische Technik Gmbh | Serial bidirectional data transmission method for hearing devices by means of signals of different pulsewidths |
| US6137889A (en) * | 1998-05-27 | 2000-10-24 | Insonus Medical, Inc. | Direct tympanic membrane excitation via vibrationally conductive assembly |
| US6188979B1 (en) * | 1998-05-28 | 2001-02-13 | Motorola, Inc. | Method and apparatus for estimating the fundamental frequency of a signal |
| US6151645A (en) * | 1998-08-07 | 2000-11-21 | Gateway 2000, Inc. | Computer communicates with two incompatible wireless peripherals using fewer transceivers |
| US6240193B1 (en) * | 1998-09-17 | 2001-05-29 | Sonic Innovations, Inc. | Two line variable word length serial interface |
| US6061431A (en) * | 1998-10-09 | 2000-05-09 | Cisco Technology, Inc. | Method for hearing loss compensation in telephony systems based on telephone number resolution |
| US6838485B1 (en) * | 1998-10-23 | 2005-01-04 | Baker Hughes Incorporated | Treatments for drill cuttings |
| US6188976B1 (en) * | 1998-10-23 | 2001-02-13 | International Business Machines Corporation | Apparatus and method for building domain-specific language models |
| US6265102B1 (en) * | 1998-11-05 | 2001-07-24 | Electric Fuel Limited (E.F.L.) | Prismatic metal-air cells |
| KR100749289B1 (ko) * | 1998-11-30 | 2007-08-14 | 코닌클리케 필립스 일렉트로닉스 엔.브이. | 텍스트의 자동 세그멘테이션 방법 및 시스템 |
| DE19858398C1 (de) * | 1998-12-17 | 2000-03-02 | Implex Hear Tech Ag | Implantierbares Gerät zum Behandeln eines Tinnitus |
| US6208273B1 (en) * | 1999-01-29 | 2001-03-27 | Interactive Silicon, Inc. | System and method for performing scalable embedded parallel data compression |
| DE19914993C1 (de) * | 1999-04-01 | 2000-07-20 | Implex Hear Tech Ag | Vollimplantierbares Hörsystem mit telemetrischer Sensorprüfung |
| DE19915846C1 (de) * | 1999-04-08 | 2000-08-31 | Implex Hear Tech Ag | Mindestens teilweise implantierbares System zur Rehabilitation einer Hörstörung |
| US6094492A (en) * | 1999-05-10 | 2000-07-25 | Boesen; Peter V. | Bone conduction voice transmission apparatus and system |
| US20020032564A1 (en) * | 2000-04-19 | 2002-03-14 | Farzad Ehsani | Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface |
| US6557029B2 (en) * | 1999-06-28 | 2003-04-29 | Micro Design Services, Llc | System and method for distributing messages |
| US6490558B1 (en) * | 1999-07-28 | 2002-12-03 | Custom Speech Usa, Inc. | System and method for improving the accuracy of a speech recognition program through repetitive training |
| US6590986B1 (en) * | 1999-11-12 | 2003-07-08 | Siemens Hearing Instruments, Inc. | Patient-isolating programming interface for programming hearing aids |
| US6324907B1 (en) * | 1999-11-29 | 2001-12-04 | Microtronic A/S | Flexible substrate transducer assembly |
| US6366880B1 (en) * | 1999-11-30 | 2002-04-02 | Motorola, Inc. | Method and apparatus for suppressing acoustic background noise in a communication system by equaliztion of pre-and post-comb-filtered subband spectral energies |
| US6601093B1 (en) * | 1999-12-01 | 2003-07-29 | Ibm Corporation | Address resolution in ad-hoc networking |
| JP2001169380A (ja) * | 1999-12-14 | 2001-06-22 | Casio Comput Co Ltd | 耳装着型音楽再生装置、及び音楽再生システム |
| US6377925B1 (en) * | 1999-12-16 | 2002-04-23 | Interactive Solutions, Inc. | Electronic translator for assisting communications |
| JP2001177596A (ja) * | 1999-12-20 | 2001-06-29 | Toshiba Corp | 通信装置および通信方法 |
| JP2001177889A (ja) * | 1999-12-21 | 2001-06-29 | Casio Comput Co Ltd | 身体装着型音楽再生装置、及び音楽再生システム |
| ES2248274T3 (es) * | 2000-01-07 | 2006-03-16 | Biowave Corporation | Aparato de electroterapia. |
| US6850775B1 (en) * | 2000-02-18 | 2005-02-01 | Phonak Ag | Fitting-anlage |
| US20010033664A1 (en) * | 2000-03-13 | 2001-10-25 | Songbird Hearing, Inc. | Hearing aid format selector |
| DE10018334C1 (de) * | 2000-04-13 | 2002-02-28 | Implex Hear Tech Ag | Mindestens teilimplantierbares System zur Rehabilitation einer Hörstörung |
| DE10018360C2 (de) * | 2000-04-13 | 2002-10-10 | Cochlear Ltd | Mindestens teilimplantierbares System zur Rehabilitation einer Hörstörung |
| DE10018361C2 (de) * | 2000-04-13 | 2002-10-10 | Cochlear Ltd | Mindestens teilimplantierbares Cochlea-Implantat-System zur Rehabilitation einer Hörstörung |
| US20010049566A1 (en) * | 2000-05-12 | 2001-12-06 | Samsung Electronics Co., Ltd. | Apparatus and method for controlling audio output in a mobile terminal |
| WO2001093627A2 (en) * | 2000-06-01 | 2001-12-06 | Otologics, Llc | Method and apparatus measuring hearing aid performance |
| DE10031832C2 (de) * | 2000-06-30 | 2003-04-30 | Cochlear Ltd | Hörgerät zur Rehabilitation einer Hörstörung |
| DE10041726C1 (de) * | 2000-08-25 | 2002-05-23 | Implex Ag Hearing Technology I | Implantierbares Hörsystem mit Mitteln zur Messung der Ankopplungsqualität |
| US20020076073A1 (en) * | 2000-12-19 | 2002-06-20 | Taenzer Jon C. | Automatically switched hearing aid communications earpiece |
| US6584356B2 (en) * | 2001-01-05 | 2003-06-24 | Medtronic, Inc. | Downloadable software support in a pacemaker |
| US20020095892A1 (en) * | 2001-01-09 | 2002-07-25 | Johnson Charles O. | Cantilevered structural support |
| US6582628B2 (en) * | 2001-01-17 | 2003-06-24 | Dupont Mitsui Fluorochemicals | Conductive melt-processible fluoropolymer |
| US6590987B2 (en) * | 2001-01-17 | 2003-07-08 | Etymotic Research, Inc. | Two-wired hearing aid system utilizing two-way communication for programming |
| US6823312B2 (en) * | 2001-01-18 | 2004-11-23 | International Business Machines Corporation | Personalized system for providing improved understandability of received speech |
| US20020150219A1 (en) * | 2001-04-12 | 2002-10-17 | Jorgenson Joel A. | Distributed audio system for the capture, conditioning and delivery of sound |
| US6913578B2 (en) * | 2001-05-03 | 2005-07-05 | Apherma Corporation | Method for customizing audio systems for hearing impaired |
| US6944474B2 (en) * | 2001-09-20 | 2005-09-13 | Sound Id | Sound enhancement for mobile phones and other products producing personalized audio for users |
| US20030128859A1 (en) * | 2002-01-08 | 2003-07-10 | International Business Machines Corporation | System and method for audio enhancement of digital devices for hearing impaired |
| CN1243541C (zh) * | 2002-05-09 | 2006-03-01 | 中国医学科学院药物研究所 | 2-(α-羟基戊基)苯甲酸盐及其制法和用途 |
-
2000
- 2000-06-30 US US09/609,202 patent/US6904402B1/en not_active Expired - Lifetime
- 2000-11-03 JP JP2001539153A patent/JP2003523559A/ja active Pending
- 2000-11-03 WO PCT/US2000/041870 patent/WO2001037128A2/en not_active Ceased
- 2000-11-03 AU AU46104/01A patent/AU4610401A/en not_active Abandoned
- 2000-11-03 CN CNB008152942A patent/CN100430929C/zh not_active Expired - Fee Related
-
2004
- 2004-05-10 US US10/842,264 patent/US20040210434A1/en not_active Abandoned
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1623412A4 (en) * | 2003-04-30 | 2008-03-19 | Bosch Gmbh Robert | METHOD FOR STATISTICAL LANGUAGE MODELING FOR VOICE RECOGNITION |
| CN102799676A (zh) * | 2012-07-18 | 2012-11-28 | 上海语天信息技术有限公司 | 一种递归多层次中文分词方法 |
| US10181098B2 (en) | 2014-06-06 | 2019-01-15 | Google Llc | Generating representations of input sequences using neural networks |
| US11222252B2 (en) | 2014-06-06 | 2022-01-11 | Google Llc | Generating representations of input sequences using neural networks |
| US11847413B2 (en) | 2014-12-12 | 2023-12-19 | Intellective Ai, Inc. | Lexical analyzer for a neuro-linguistic behavior recognition system |
| US12032909B2 (en) | 2014-12-12 | 2024-07-09 | Intellective Ai, Inc. | Perceptual associative memory for a neuro-linguistic behavior recognition system |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2001037128A3 (en) | 2002-02-07 |
| CN1387651A (zh) | 2002-12-25 |
| AU4610401A (en) | 2001-05-30 |
| US6904402B1 (en) | 2005-06-07 |
| CN100430929C (zh) | 2008-11-05 |
| JP2003523559A (ja) | 2003-08-05 |
| US20040210434A1 (en) | 2004-10-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6904402B1 (en) | System and iterative method for lexicon, segmentation and language model joint optimization | |
| US7020587B1 (en) | Method and apparatus for generating and managing a language model data structure | |
| US7275029B1 (en) | System and method for joint optimization of language model performance and size | |
| JP4945086B2 (ja) | 論理形式のための統計的言語モデル | |
| US7493251B2 (en) | Using source-channel models for word segmentation | |
| US6816830B1 (en) | Finite state data structures with paths representing paired strings of tags and tag combinations | |
| US7158930B2 (en) | Method and apparatus for expanding dictionaries during parsing | |
| CN110457708B (zh) | 基于人工智能的词汇挖掘方法、装置、服务器及存储介质 | |
| US20030046078A1 (en) | Supervised automatic text generation based on word classes for language modeling | |
| US9720903B2 (en) | Method for parsing natural language text with simple links | |
| Babii et al. | Modeling vocabulary for big code machine learning | |
| CN114154487A (zh) | 文本自动纠错方法、装置、电子设备及存储介质 | |
| CN112232057B (zh) | 基于文本扩展的对抗样本生成方法、装置、介质和设备 | |
| US20060277028A1 (en) | Training a statistical parser on noisy data by filtering | |
| JP2006065387A (ja) | テキスト文検索装置、テキスト文検索方法、及びテキスト文検索プログラム | |
| CN114328822A (zh) | 一种基于深度数据挖掘的合同文本智能分析方法 | |
| EP3598321A1 (en) | Method for parsing natural language text with constituent construction links | |
| US10810368B2 (en) | Method for parsing natural language text with constituent construction links | |
| CN111328416B (zh) | 用于自然语言处理中的模糊匹配的语音模式 | |
| CN109189907A (zh) | 一种基于语义匹配的检索方法及装置 | |
| JP5291645B2 (ja) | データ抽出装置、データ抽出方法、及びプログラム | |
| Mammadov et al. | Part-of-speech tagging for azerbaijani language | |
| JP5500636B2 (ja) | 句テーブル生成器及びそのためのコンピュータプログラム | |
| Pla et al. | Improving chunking by means of lexical-contextual information in statistical language models | |
| CN114817458A (zh) | 一种基于漏斗模型和余弦算法的中标项目检索方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 008152942 Country of ref document: CN |
|
| ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2001 539153 Kind code of ref document: A Format of ref document f/p: F |
|
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase |