WO2003003241A1 - Predictive cascading algorithm for multi-parser architecture - Google Patents

Predictive cascading algorithm for multi-parser architecture Download PDF

Info

Publication number
WO2003003241A1
WO2003003241A1 PCT/CN2001/001065 CN0101065W WO03003241A1 WO 2003003241 A1 WO2003003241 A1 WO 2003003241A1 CN 0101065 W CN0101065 W CN 0101065W WO 03003241 A1 WO03003241 A1 WO 03003241A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
grammar
analyzed
word
applicable
Prior art date
Application number
PCT/CN2001/001065
Other languages
French (fr)
Inventor
Kui Xu
Fuliang Weng
Original Assignee
Intel Corporation
Intel China Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation, Intel China Ltd. filed Critical Intel Corporation
Priority to PCT/CN2001/001065 priority Critical patent/WO2003003241A1/en
Priority to CNA018235743A priority patent/CN1545665A/en
Publication of WO2003003241A1 publication Critical patent/WO2003003241A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Definitions

  • the present invention generally relates to language processing, and in particular to multi-parser architecture for language processing.
  • a language parser may be used to derive a syntactic tree that shows syntactic relationship between the words and phrases in input text data, such as a sentence.
  • the language parser is an important component in various natural language applications. Due to the complexity of natural language itself, the grammar rules that describe the natural language is also complex. The traditional way of using a single language parser that encompasses the entire grammar of a natural language may lead to an inefficient, inaccurate, and often very large language processing system.
  • Figure 1 is a block diagram of a multi-parser system for parsing natural language sentences according to one embodiment of the invention.
  • Figures 2A-2C are flowcharts of parsing an input sentence according to one embodiment of the invention.
  • Figure 3 shows sub-grammars according to one embodiment of the invention.
  • Figure 4 is a table diagram showing the contents of a left corner condition table constructed according to one embodiment of the invention.
  • Figure 5 is a table diagram showing the contents of a sample symbol table.
  • Figure 6 is a table diagram showing a portion of a sample hash table.
  • Figure 7 is table diagram showing a sample left corner condition table constructed according to another embodiment of the invention.
  • Figure 8 is a lattice diagram illustrating parsing of an example sentence according to one embodiment of the invention.
  • Figure 9 is a lattice table according to one embodiment of the invention.
  • Figure 1 depicts a multi-parser system 100 for parsing natural language sentences according to one embodiment of the invention.
  • the multi- parser system 100 includes a predictive cascading unit 102, a probability score assigning unit 110, a Viterbi unit 112, a left comer condition table 108 and a hash table 120. Also included in the multi-parser system 100 are a number of sub-grammar parsers 114-118.
  • a grammar of a natural language e.g., English, Chinese, etc.
  • the size of a single parser to handle the entire grammar may be too large for programming languages.
  • a grammar is partitioned into a number of sub-grammar categories.
  • different sub-grammar parsers 114-118 are configured to handle different sub-grammar categories of a grammar, such as sentence (S), noun phrase (NP), verb phrase (VP) and prepositional phrase (PP), etc.
  • Each sub-grammar parser contains a set of grammar rules associated with the corresponding category of the grammar.
  • the predictive cascading unit 102 is configured to invoke only those sub-grammar parsers at each location in the input sentence when it meets certain criteria, called left corner condition. If it meets one of the left corner conditions associated with a sub-grammar parser, it invokes that particular sub-grammar parser at that particular place. Otherwise, that sub-grammar parser will not be invoked.
  • the sub-grammar parser checks the validity of the sequence of the current word and subsequent word or phrase being analyzed. If the sub-grammar parser determines that such sequence is valid, it will return one or more result edges to indicate that such sequence has satisfied one or more rules specified therein.
  • the parsing operations can be broken down into three stages, namely a pre-processing stage, a main processing stage and a post-processing stage.
  • the predictive cascading unit 102 receives an input sentence from an external source in block 200.
  • the predictive cascading unit 102 constructs a lattice table containing a list of edges initially derived by assigning an input edge for each word of the input sentence.
  • the input edges are sorted in reverse topological order and placed into a stack in block 210.
  • the main processing stage consists of a loop (blocks 215-245) to identify one or more sub-grammar parsers applicable to each edge contained in the stack and to parse the respective edge using the sub-grammar parsers identified as being applicable.
  • the loop processes each entry loaded in the stack individually starting from the highest entry. If the stack is not empty (block 215, no), the predictive cascading unit proceeds to block 220 where an edge from the top of the stack is popped out. The most recently popped out of the stack will be referred as a current edge.
  • a grammar of a natural language typically consists of a large number of rules.
  • the present invention provides a way to reduce the amount of time needed to identify grammar rules that may be applicable to the current edge.
  • a left corner condition table is used to identify those sub-grammar categories that may be applicable to the current word without having to search through the entire set of grammar rules.
  • the predictive cascading unit identifies one or more sub-grammar categories applicable to the current edge based on the left corner condition table in block 225. Then in block 230, the multi-parser system parses the current edge by invoking the sub-gram ar parsers associated with the sub- grammar categories identified as being applicable. If the sub-grammar parser produces a result edge (block 235, yes), then multi-parser system takes the result edge and updates the stack by pushing the result edge into the stack in block 240.
  • the multi-parser system In addition to updating the stack, the multi-parser system also takes the result edge derived from the respective sub-grammar parser and adds it to the lattice table in block 245. This loop (blocks 215-245) is continued until all of the entries in the stack have been popped out and processed. [00020]
  • the multi- parser system proceeds to the post processing stage shown in figure 2C.
  • the post-processing stage begins in block 250 where a probability score is assigned to each edge in the lattice table. Then based on this probability score information, a best path may be selected based on a Viterbi algorithm in block 255.
  • FIG. 4 is a table diagram showing the contents of a sample left corner condition (LCC) table, which may be used to rapidly identify which sub-grammar categories are applicable to a current word being analyzed.
  • the LCC table is constructed and stored, for example in data storage such as a hard drive or other storage device, prior to analyzing input sentences.
  • the LCC table has columns as follows: a sub-grammar name column 404 containing the names of the sub-grammar categories and a left corner condition column 406 containing a number of words, symbols and phrases (also referred to as "left corner condition") associated with the sub-grammar category.
  • the LCC table also contains a sub-grammar identification (GID) column, which contains a value uniquely identifying each row in the table that can be used to quickly invoke the sub-grammar parser associated with the corresponding row (i.e., sub-grammar category).
  • GID sub-grammar identification
  • the sub-grammar parser applicable to the current word can be determined by matching the current word being analyzed with those words (i.e., left corner conditions) listed each row of the LCC table.
  • the size of the LCC table has been severely constrained to provide a manageable example.
  • the actual LCC table may contain any number of sub-grammar categories (e.g., tens or even hundreds) and each sub-grammar category may contain any number of left comer conditions (e.g., hundreds or even thousands).
  • the noun phrase (NP) sub-grammar category includes words that are associated with a noun phrase. If the current word or phrase being analyzed is "the", the word “the” is looked up in this LCC table by comparing it with words listed under the left corner condition column. Since the word "the” is associated with a noun phrase (NP) sub-grammar category, the sub-grammar parser associated with noun phrase (NP) will be selected to parse the word or phrase beginning with "the”. It should be understood that there are certain words that may reside in more than one sub- grammar categories. For example, the word “copies" may be a noun in certain sentence context and may also be a verb is other context.
  • each word, phrase or symbol recognizable by the multi-parser system is assigned a unique symbol identification number (SID).
  • SID symbol identification number
  • the symbol table contains a symbol column 502 and a symbol identification (SID), which contains a value uniquely identifying each word, phrase or symbol associated with each row in the SID table that can be used to quickly locate and retrieve information from a lookup table shown in figure 7.
  • SID symbol identification
  • the word "man” is arbitrary assigned SID number "8”.
  • NP noun phrase
  • the noun phrase (NP) sub-grammar category may be associated with all words found in a dictionary that are in some ways associated with a noun phrase.
  • FIG. 6 is a table diagram showing a portion of a sample hash table, which may be used to determine an SID associated with a particular word, symbol or phrase.
  • the illustrated hash table has an index column 602 and a symbol column 604.
  • Each entry (i.e., row) in the hash table includes a symbol (e.g., a symbol, a word or a phrase) which is indexed according to a hash value.
  • the hash value associated with each symbol may be obtained by applying a hash function to the corresponding symbol.
  • the hash value of a symbol may be computed based on the American Standard Code for Information Interchange (ASCII) code of the symbol.
  • ASCII American Standard Code for Information Interchange
  • the hash table also contains a symbol identification (SID) column which contains a value uniquely identifying each symbol in the table to rapidly locate and examine the corresponding row in the lookup table of figure 7.
  • SID symbol identification
  • the SID specified in the hash table corresponds with the SID assigned to each symbol in the symbol table.
  • the hash table may be used to lookup the SID associated with a particular word by first computing the hash value of the particular word and then using the hash value to index through the hash table in order to obtain the proper SID.
  • Figure 7 is a table diagram showing a lookup table constructed in accordance with one embodiment of the invention.
  • the predictive cascading unit 102 is able to rapidly identify one or more sub-grammar categories applicable to a current word, phrase or symbol being analyzed.
  • the lookup table is constructed and stored in data storage prior to analyzing of input sentences.
  • the illustrated lookup table has a symbol column 702 containing symbols (e.g., symbols, words or phrases) recognizable by the multi-parser system, a symbol identification (SID) column 704 and a number of sub- grammar parser columns 706-712.
  • SID symbol identification
  • Each row in the lookup table is associated with a particular word that is indexed by the corresponding SID value. It should be appreciated the size of the lookup table has been severely constrained to provide a manageable example.
  • the actual lookup table may contain any number of symbols (e.g., thousands) and any number of sub- grammar parser columns (e.g., tens or hundreds).
  • the sub-grammar parser column 710 corresponds to a noun phrase (NP) sub-grammar parser and contains indications of whether the symbol associated with each of the lookup table is applicable to the sub- grammar parser (NP).
  • the columns 706, 708 and 710 correspond to a sentence (S) sub-grammar parser, a verb phrase (VP) sub-grammar parser and a prepositional phrase (PP) sub-grammar parser.
  • the eighth row of column 710 in the lookup table has a "man” in the symbol column and a "1" in the sub-grammar (NP) column, indicating that the sub-grammar parser (NP) is applicable to the symbol "man”.
  • the eighth row of column 710 has a "0" in the sub-grammar (VP) column 708, indicating that the sub- grammar parser (VP) is not applicable to the symbol "man”.
  • an index value e.g., SID
  • the word "man” has an index value "8" according to a sample hash table provided in figure 6.
  • the predictive cascading unit interrogates the entire row by traversing from one column to another column to identify one or more sub-grammar parser fields that contains "1".
  • the sub-grammar searching process consists of two major stages.
  • the first stage involves identifying an index value associated with the current word being analyzed. Then, by indexing through the lookup table using the index value derived, the sub- grammar categories applicable to the current word ma be rapidly identified. Once sub-grammar categories have been identified, the multi-parser system invokes only those sub-grammar parsers that correspond to the sub-grammar categories identified as being applicable.
  • the determination of the applicable sub-grammar parsers may be further expedited by indexing directly to the lookup table using a hash value or any other suitable indexing method. Accordingly, in this alternative embodiment, the symbols specified within the symbol column 702 of the lookup table shown in figure 7 will be rearranged such that they are indexed based on the hash value or other suitable indexing value.
  • the multi-parser system of the present invention is able to significantly expedite the process of determining which grammar rules may be applicable to a particular word or phrase.
  • the multi-parser system is able to rapidly eliminate majority of sub-grammar categories that are not applicable to a current word being analyzed, without having to go through the entire set of rules listed within individual sub-grammar categories. This means that the multi-parser system is able to focus its analysis only to those sub-grammar categories that have been identified as being relevant to the parsing of the current word without wasting time searching through all sub-grammar categories.
  • the parsing operations of the present invention will be described in reference to a lattice diagram shown in figure 8.
  • the lattice diagram includes nodes (Nl through N8) and edges (Ll through L17) between the nodes. Each edge represents a portion of the input sentence.
  • the input edges Ll through L7 correspond to each individual word in the input sentence.
  • the edges L8 through L17 correspond to the result edges generated by the sub- grammar parsers.
  • Each edge corresponds to a word or a portion of the input sentence and is used to indicate the part of speech, the grammatical structure or phrase structure of the corresponding portion thereof.
  • a cascading composition algorithm is employed in which the underlying syntactic structure of a sentence is determined from bottom-up.
  • individual words of the sentence are initially analyzed right-to-left and subsequently to intermediate phrases until the analysis of the sentence is complete.
  • each individual edges Ll though L7 will be placed in a stack in a reverse topological order such that the word "house” will be at the top of the stack. Accordingly, the word "house” will be popped out of the stack and will be designated as the current word being processed.
  • the predictive cascading unit 102 identifies which sub-grammar categories are associated with the current word.
  • the predictive cascading unit 102 is able to determine the SID associated with the current word "house”. Then by indexing through the lookup table (figure 7) using the SID derived, the predictive cascading unit 102 is able to identify which sub-grammar categories may be applicable by traversing across the ninth row; zero "0" means that that particular sub-grammar category is not applicable and one "1" means that that particular sub-grammar category may be applicable.
  • the word "house” is associated with the sub-grammar parser (NP).
  • a result edge L8 is derived.
  • the edge L8 extending from node N7 to N8 is labeled as a noun phrase (NP) because the word "house” met the requirement of one of the rules set forth in the sub- grammar parser (NP).
  • the predictive cascading unit 102 proceeds to parse the next word in the sentence.
  • a noun phrase (NP) sub-grammar parser will be invoked at the next location (input edge L6) in the input sentences.
  • the sub-grammar parser (NP) Once the sub-grammar parser (NP) is invoked, it will determine if the sequence of words "the house” is a valid noun phrase. Since the phrase "the house” is one of the valid noun phrases listed in the sub-grammar noun phrase (NP) shown in figure 3, it returns a result edge L9 labeled as a noun phrase to indicate that a determine "the” followed by the noun "house” has satisfied one of the rules specified in the sub-grammar (NP).
  • the lattice table 104 (shown in figure 9) and the stack 105 maintained by the predictive cascading unit 102 are updated with the new result edge. This parsing process is repeated until all entries in the stack have been processed. At the end, a single edge covering the entire sentence is derived.
  • the edge L17 covering the entire input sentence from node Nl to node N8 is derived at the end by the sub- grammar parser (S) and is labeled as a sentence (S) because the sentence (S) (the result edge L16) followed by a prepositional phrase (PP) (the result edge L10) meets the requirement of one of the rules set forth in the sub-grammar (S) shown in figure 3.
  • Figure 9 depicts a lattice table containing a list of edges, which correspond to the edges shown in the lattice diagram of figure 8.
  • each edge is associated with an edge name (e.g., NP, PP) and a set of nodes (i.e., a start node and an end node) defining the boundaries of the edge.
  • edge name e.g., NP, PP
  • nodes i.e., a start node and an end node
  • Each sub- grammar contains a set of grammar rules.
  • the rules contained in the sub- grammars enable the predictive cascading unit to determine if the current word and subsequent words or phrases can be combined into a phrase, such as noun phrases, prepositional phrases, verb phrases, and sentences. It should be understood that the number of rules associated with each sub-grammar shown in figure 3 has been severely constrained to provide a manageable example and the actual sub-grammars may contain any number of rules.
  • the grammar of the English language may be partitioned into any number of sub-grammars (e.g., tens or even hundreds). Each sub- grammar parser associated with a respective sub-grammar is configured to extract certain language feature.
  • one sub-grammar parser may be configured to extract portions of the input sentence that describe time
  • another sub-grammar parser may be configured to extract portions that describe dates. For example: "I went to California last year", the sub-grammar parser for time will extract "last year” and identify that portion of the sentence as time. Examples of other sub-grammar parsers include location, age, money amount, etc.
  • the sub-grammar sentence (S) contains two rules.
  • the first rule in the sub-grammar (S) indicates that a sentence can be composed of a noun phrase (NP) followed by a verb phrase (VP).
  • the second rule in the sub-grammar (S) indicates that a sentence (S) can also be formed by combining a sentence (S) with a prepositional phrase (PP).
  • the sub-grammar verb phase (VP) specifies that a verb phase may be composed of a verb such as "saw” or "ran” followed by a noun phrase (NP) or a prepositional phrase (PP).
  • sub- grammar noun phrase specifies that a noun phrase may be composed of a single word such as "I”, “he”, “man”, “house”, etc or may be composed of a combination of words such as a determiner "a” or “the” followed by a noun such as "man", "house” or “telescope”.
  • sub-grammar prepositional phrase specifies that a prepositional phrase (PP) may be composed of a preposition such as "in” or "with” followed by a noun phrase (NP).
  • the operations performed by the present invention may be embodied in the form of software program stored on a machine-readable medium, such as, but is not limited to, any type of disk including floppy disks, hard disks, optical discs, CD-ROMs, and magneto-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • ROMs read-only memories
  • RAMs random access memories
  • EPROMs electrically erasable programmable read-only memories
  • EEPROMs electrically erasable programmable read-only memory
  • magnetic or optical cards or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
  • the multi-parser system constructed according to the present invention may provide a significant advantage over the conventional parsers. At least some of the conventional parsers compare all individual rules defining the entire grammar with a portion of a sentence being analyzed to determine which rules are applicable. Consequently, the conventional parsing process can be time consuming since a grammar typically consists of a very large number of rules and the conven ional parser must search through the entire list of rules each time a different portion of the sentence is analyzed until the whole sentence is parsed.
  • the present invention by constructing a le t comer condition table that specifies the left corner conditions for individual sub-grammar parsers prior to analyzing of input sentences and using such left corner condition table for the invocation of applicable sub-grammar parsers at each position along the input sentence, the amount of time required to perform parsing operations is significantly reduced.
  • the present invention improves the performance of a multi-parser architecture which is an important component for various natural language understanding applications, including extraction and summarization, robust understanding systems.
  • Another advantage achievable by the present invention relates to easily incorporate a large number of new sub-grammar parsers into the multi-parser architecture system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A system is described for parsing an input sentence. The system includes a set of sub-grammar parsers, each sub-grammar parser containing a set of grammar rules to parse a portion of a natural language grammar. The system further includes a subsystem to identify at least one of said sub-grammar parsers applicable to a portion of an input text being analyzed and to parse said portion using said at least one of said sub-grammar parsers identified as being applicable. In one embodiment, a table is used to facilitate the sub-grammar searching process. The table contains a number of sub-grammar categories, each sub-grammar category associated with a set of left corner conditions. One or more sub-grammar categories applicable to a current word being analyzed can be identified by comparing the current word being analyzed against the left corner conditions associated with individual sub-grammar categories.

Description

PREDICTIVE CASCADING ALGORITHM FOR MULTI-PARSER
ARCHITECTURE
BACKGROUND OF THE INVENΗON
Field of the Invention
[0001] The present invention generally relates to language processing, and in particular to multi-parser architecture for language processing.
Description of the Related Art
[0002] To process natural language sentences, a language parser may be used to derive a syntactic tree that shows syntactic relationship between the words and phrases in input text data, such as a sentence. The language parser is an important component in various natural language applications. Due to the complexity of natural language itself, the grammar rules that describe the natural language is also complex. The traditional way of using a single language parser that encompasses the entire grammar of a natural language may lead to an inefficient, inaccurate, and often very large language processing system.
[0003] With rapid increase in number of Internet users, the demand for usage of natural language applications in the Internet have increased in the recent past and will continue to increase. The problems associated with using a single language parser to analyze the entire grammar of a particular language have become more apparent with the increasing demand for natural language applications over the Internet. In general, the existing parsing techniques are not fast enough for natural language applications in certain environments such as the Internet.
BRIEF. DESCRIPTION OF THE DRAWINGS
[0004] Figure 1 is a block diagram of a multi-parser system for parsing natural language sentences according to one embodiment of the invention. [0005] Figures 2A-2C are flowcharts of parsing an input sentence according to one embodiment of the invention.
[0006] Figure 3 shows sub-grammars according to one embodiment of the invention. [0007] Figure 4 is a table diagram showing the contents of a left corner condition table constructed according to one embodiment of the invention.
[0008] Figure 5 is a table diagram showing the contents of a sample symbol table.
[0009] Figure 6 is a table diagram showing a portion of a sample hash table.
[00010] Figure 7 is table diagram showing a sample left corner condition table constructed according to another embodiment of the invention.
[00011] Figure 8 is a lattice diagram illustrating parsing of an example sentence according to one embodiment of the invention.
[00012] Figure 9 is a lattice table according to one embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[00013] In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order to avoid obscuring the present invention.
[00014] Figure 1 depicts a multi-parser system 100 for parsing natural language sentences according to one embodiment of the invention. The multi- parser system 100 includes a predictive cascading unit 102, a probability score assigning unit 110, a Viterbi unit 112, a left comer condition table 108 and a hash table 120. Also included in the multi-parser system 100 are a number of sub-grammar parsers 114-118. It should be noted that a grammar of a natural language (e.g., English, Chinese, etc.) usually consists of a large number of rules. As a result, the size of a single parser to handle the entire grammar may be too large for programming languages. To avoid problems associated with encompassing the entire grammar with a single parser, a grammar is partitioned into a number of sub-grammar categories. In this regard, different sub-grammar parsers 114-118 are configured to handle different sub-grammar categories of a grammar, such as sentence (S), noun phrase (NP), verb phrase (VP) and prepositional phrase (PP), etc. Each sub-grammar parser contains a set of grammar rules associated with the corresponding category of the grammar.
[00015] In one embodiment, the predictive cascading unit 102 is configured to invoke only those sub-grammar parsers at each location in the input sentence when it meets certain criteria, called left corner condition. If it meets one of the left corner conditions associated with a sub-grammar parser, it invokes that particular sub-grammar parser at that particular place. Otherwise, that sub-grammar parser will not be invoked. When a sub-grammar parser is invoked at certain location in the input sentence, the sub-grammar parser checks the validity of the sequence of the current word and subsequent word or phrase being analyzed. If the sub-grammar parser determines that such sequence is valid, it will return one or more result edges to indicate that such sequence has satisfied one or more rules specified therein. Once a result edge has been returned by a sub-grammar parser, the lattice table 104 and the stack 106 maintained by the predictive cascading unit 102 are updated. This parsing process is repeated until all of the words and phrases within the sentence have been examined. Once the parsing process has been completed, the probability score assigning unit 110 examines the lattice table 104 and assigns a probability score to individual result edges listed in the lattice table 104. Then, a syntactic structure of the input sentence may be determined by the Viterbi unit 112 by selecting a best path based the probability score assigned to the result edges. [00016] Figures 2A-2C depicts operations of the multi-parser system according to one embodiment of the invention. The parsing operations can be broken down into three stages, namely a pre-processing stage, a main processing stage and a post-processing stage. During the pre-processing stage and as shown in figure 2A, the predictive cascading unit 102 receives an input sentence from an external source in block 200. Then in block 205, the predictive cascading unit 102 constructs a lattice table containing a list of edges initially derived by assigning an input edge for each word of the input sentence. Additionally during the pre-processing stage, the input edges are sorted in reverse topological order and placed into a stack in block 210. [00017] Once the pre-processing stage has been completed, the multi- parser system proceeds to the main processing stage as shown in figure 2B. The main processing stage consists of a loop (blocks 215-245) to identify one or more sub-grammar parsers applicable to each edge contained in the stack and to parse the respective edge using the sub-grammar parsers identified as being applicable. The loop (blocks 215-245) processes each entry loaded in the stack individually starting from the highest entry. If the stack is not empty (block 215, no), the predictive cascading unit proceeds to block 220 where an edge from the top of the stack is popped out. The most recently popped out of the stack will be referred as a current edge.
[00018] As noted above, a grammar of a natural language typically consists of a large number of rules. The present invention provides a way to reduce the amount of time needed to identify grammar rules that may be applicable to the current edge. In one embodiment, a left corner condition table is used to identify those sub-grammar categories that may be applicable to the current word without having to search through the entire set of grammar rules.
[00019] Accordingly, once a current edge has been retrieved from the top of the stack, the predictive cascading unit identifies one or more sub-grammar categories applicable to the current edge based on the left corner condition table in block 225. Then in block 230, the multi-parser system parses the current edge by invoking the sub-gram ar parsers associated with the sub- grammar categories identified as being applicable. If the sub-grammar parser produces a result edge (block 235, yes), then multi-parser system takes the result edge and updates the stack by pushing the result edge into the stack in block 240. In addition to updating the stack, the multi-parser system also takes the result edge derived from the respective sub-grammar parser and adds it to the lattice table in block 245. This loop (blocks 215-245) is continued until all of the entries in the stack have been popped out and processed. [00020] When the main processing stage has been completed, the multi- parser system proceeds to the post processing stage shown in figure 2C. The post-processing stage begins in block 250 where a probability score is assigned to each edge in the lattice table. Then based on this probability score information, a best path may be selected based on a Viterbi algorithm in block 255. Finally in block 260, a syntatic tree is derived based on the best path selected by the Viterbi unit. [00021] Figure 4 is a table diagram showing the contents of a sample left corner condition (LCC) table, which may be used to rapidly identify which sub-grammar categories are applicable to a current word being analyzed. The LCC table is constructed and stored, for example in data storage such as a hard drive or other storage device, prior to analyzing input sentences. The LCC table has columns as follows: a sub-grammar name column 404 containing the names of the sub-grammar categories and a left corner condition column 406 containing a number of words, symbols and phrases (also referred to as "left corner condition") associated with the sub-grammar category. The LCC table also contains a sub-grammar identification (GID) column, which contains a value uniquely identifying each row in the table that can be used to quickly invoke the sub-grammar parser associated with the corresponding row (i.e., sub-grammar category). In use, the sub-grammar parser applicable to the current word can be determined by matching the current word being analyzed with those words (i.e., left corner conditions) listed each row of the LCC table. It should be appreciated the size of the LCC table has been severely constrained to provide a manageable example. The actual LCC table may contain any number of sub-grammar categories (e.g., tens or even hundreds) and each sub-grammar category may contain any number of left comer conditions (e.g., hundreds or even thousands).
[00022] For example, the noun phrase (NP) sub-grammar category includes words that are associated with a noun phrase. If the current word or phrase being analyzed is "the", the word "the" is looked up in this LCC table by comparing it with words listed under the left corner condition column. Since the word "the" is associated with a noun phrase (NP) sub-grammar category, the sub-grammar parser associated with noun phrase (NP) will be selected to parse the word or phrase beginning with "the". It should be understood that there are certain words that may reside in more than one sub- grammar categories. For example, the word "copies" may be a noun in certain sentence context and may also be a verb is other context. In cases where the current word is associated with more than one sub-grammar categories, all corresponding sub-grammar parsers will be invoked to parse the current word. Consequently, there may be more than one result edge produced during parsing of a particular word or phrase. [00023] In one embodiment, to simplify the process of identifying sub- grammar categories that may be applicable to the current word being analyzed, each word, phrase or symbol recognizable by the multi-parser system is assigned a unique symbol identification number (SID). Figure 5 is a table diagram shown a symbol table. The symbol table contains a symbol column 502 and a symbol identification (SID), which contains a value uniquely identifying each word, phrase or symbol associated with each row in the SID table that can be used to quickly locate and retrieve information from a lookup table shown in figure 7. For example, in the sample symbol table shown in figure 5, the word "man" is arbitrary assigned SID number "8". [00024] As noted above, at least some of the sub-grammar categories listed in an actual LCC table will be associated with a large number of words, phrases and symbols. For example, the noun phrase (NP) sub-grammar category may be associated with all words found in a dictionary that are in some ways associated with a noun phrase. Consequently, in order to determine whether or not a particular word is associated with a particular sub- grammar category, the current word being analyzed must be compared against the entire list of words associated with each individual sub-grammar categories. In this regard, the present invention incorporates a search technique to reduce the amount of time and processing necessary to identify those sub-grammar categories listed in the LCC table that may be associated with the current word. In one embodiment, a hash table is utilized to expedite the process of searching through the left corner condition table. [00025] Figure 6 is a table diagram showing a portion of a sample hash table, which may be used to determine an SID associated with a particular word, symbol or phrase. The illustrated hash table has an index column 602 and a symbol column 604. Each entry (i.e., row) in the hash table includes a symbol (e.g., a symbol, a word or a phrase) which is indexed according to a hash value. The hash value associated with each symbol may be obtained by applying a hash function to the corresponding symbol. For example, the hash value of a symbol may be computed based on the American Standard Code for Information Interchange (ASCII) code of the symbol. In this regard, since the ASCII code for the word "man" is "10997110", the hash value or index of the word "man" is 1866 [i.e., index = (Q09*31+97)*31+110)mod2000 = 1866]. The hash table also contains a symbol identification (SID) column which contains a value uniquely identifying each symbol in the table to rapidly locate and examine the corresponding row in the lookup table of figure 7. In one embodiment, the SID specified in the hash table corresponds with the SID assigned to each symbol in the symbol table. In this regard, the hash table may be used to lookup the SID associated with a particular word by first computing the hash value of the particular word and then using the hash value to index through the hash table in order to obtain the proper SID. [00026] Figure 7 is a table diagram showing a lookup table constructed in accordance with one embodiment of the invention. By using the SID obtained from the hash table to index through the lookup table, the predictive cascading unit 102 is able to rapidly identify one or more sub-grammar categories applicable to a current word, phrase or symbol being analyzed. The lookup table is constructed and stored in data storage prior to analyzing of input sentences. The illustrated lookup table has a symbol column 702 containing symbols (e.g., symbols, words or phrases) recognizable by the multi-parser system, a symbol identification (SID) column 704 and a number of sub- grammar parser columns 706-712. Each row in the lookup table is associated with a particular word that is indexed by the corresponding SID value. It should be appreciated the size of the lookup table has been severely constrained to provide a manageable example. The actual lookup table may contain any number of symbols (e.g., thousands) and any number of sub- grammar parser columns (e.g., tens or hundreds). [00027] The sub-grammar parser column 710 corresponds to a noun phrase (NP) sub-grammar parser and contains indications of whether the symbol associated with each of the lookup table is applicable to the sub- grammar parser (NP). Similarly, the columns 706, 708 and 710 correspond to a sentence (S) sub-grammar parser, a verb phrase (VP) sub-grammar parser and a prepositional phrase (PP) sub-grammar parser. For example, the eighth row of column 710 in the lookup table has a "man" in the symbol column and a "1" in the sub-grammar (NP) column, indicating that the sub-grammar parser (NP) is applicable to the symbol "man". In contrast, the eighth row of column 710 has a "0" in the sub-grammar (VP) column 708, indicating that the sub- grammar parser (VP) is not applicable to the symbol "man". In use, an index value (e.g., SID) associated with a particular word may be derived by using a hash table or any other suitable indexing method. For example, the word "man" has an index value "8" according to a sample hash table provided in figure 6. Then, by using this index value "8" to index through the lookup table, the word "man" can be easily located. Once a particular word has been located in the lookup table by using a proper index value, the predictive cascading unit interrogates the entire row by traversing from one column to another column to identify one or more sub-grammar parser fields that contains "1".
[00028] In this illustrated embodiment, the sub-grammar searching process consists of two major stages. The first stage involves identifying an index value associated with the current word being analyzed. Then, by indexing through the lookup table using the index value derived, the sub- grammar categories applicable to the current word ma be rapidly identified. Once sub-grammar categories have been identified, the multi-parser system invokes only those sub-grammar parsers that correspond to the sub-grammar categories identified as being applicable.
[00029] In an alternative embodiment, the determination of the applicable sub-grammar parsers may be further expedited by indexing directly to the lookup table using a hash value or any other suitable indexing method. Accordingly, in this alternative embodiment, the symbols specified within the symbol column 702 of the lookup table shown in figure 7 will be rearranged such that they are indexed based on the hash value or other suitable indexing value.
[00030] Advantageously, by using the left corner condition table shown in figure 4 or the lookup table shown in figure 7, the multi-parser system of the present invention is able to significantly expedite the process of determining which grammar rules may be applicable to a particular word or phrase. According to one embodiment of the invention, the multi-parser system is able to rapidly eliminate majority of sub-grammar categories that are not applicable to a current word being analyzed, without having to go through the entire set of rules listed within individual sub-grammar categories. This means that the multi-parser system is able to focus its analysis only to those sub-grammar categories that have been identified as being relevant to the parsing of the current word without wasting time searching through all sub-grammar categories.
[00031] The parsing operations of the present invention will be described in reference to a lattice diagram shown in figure 8. In tine example shown in figure 8, an input sentence "I saw a man in the house" is provided. The lattice diagram includes nodes (Nl through N8) and edges (Ll through L17) between the nodes. Each edge represents a portion of the input sentence. The input edges Ll through L7 correspond to each individual word in the input sentence. The edges L8 through L17 correspond to the result edges generated by the sub- grammar parsers. Each edge corresponds to a word or a portion of the input sentence and is used to indicate the part of speech, the grammatical structure or phrase structure of the corresponding portion thereof. [00032] In the illustrated embodiment, a cascading composition algorithm is employed in which the underlying syntactic structure of a sentence is determined from bottom-up. In this embodiment, individual words of the sentence are initially analyzed right-to-left and subsequently to intermediate phrases until the analysis of the sentence is complete. [00033] Initially, after the input sentence has been received, each individual edges Ll though L7 will be placed in a stack in a reverse topological order such that the word "house" will be at the top of the stack. Accordingly, the word "house" will be popped out of the stack and will be designated as the current word being processed. Then, the predictive cascading unit 102 identifies which sub-grammar categories are associated with the current word. First the hash value associated with the current word "house" is computed. Then using the hash value to index through the hash table shown in figure 6, the predictive cascading unit 102 is able to determine the SID associated with the current word "house". Then by indexing through the lookup table (figure 7) using the SID derived, the predictive cascading unit 102 is able to identify which sub-grammar categories may be applicable by traversing across the ninth row; zero "0" means that that particular sub-grammar category is not applicable and one "1" means that that particular sub-grammar category may be applicable. According to the lookup table, the word "house" is associated with the sub-grammar parser (NP). Once the sub-grammar parser (NP) is invoked to parser the word "house", a result edge L8 is derived. The edge L8 extending from node N7 to N8 is labeled as a noun phrase (NP) because the word "house" met the requirement of one of the rules set forth in the sub- grammar parser (NP).
[00034] The predictive cascading unit 102 proceeds to parse the next word in the sentence. In this example, a noun phrase (NP) sub-grammar parser will be invoked at the next location (input edge L6) in the input sentences. Once the sub-grammar parser (NP) is invoked, it will determine if the sequence of words "the house" is a valid noun phrase. Since the phrase "the house" is one of the valid noun phrases listed in the sub-grammar noun phrase (NP) shown in figure 3, it returns a result edge L9 labeled as a noun phrase to indicate that a determine "the" followed by the noun "house" has satisfied one of the rules specified in the sub-grammar (NP). Once the result edge L9 extending from node 6 to node 8 is labeled as a noun phrase (NP) has been returned by the sub-grammar parser (NP), the lattice table 104 (shown in figure 9) and the stack 105 maintained by the predictive cascading unit 102 are updated with the new result edge. This parsing process is repeated until all entries in the stack have been processed. At the end, a single edge covering the entire sentence is derived. In this example, the edge L17 covering the entire input sentence from node Nl to node N8 is derived at the end by the sub- grammar parser (S) and is labeled as a sentence (S) because the sentence (S) (the result edge L16) followed by a prepositional phrase (PP) (the result edge L10) meets the requirement of one of the rules set forth in the sub-grammar (S) shown in figure 3.
[00035] Figure 9 depicts a lattice table containing a list of edges, which correspond to the edges shown in the lattice diagram of figure 8. In the illustrated lattice table, each edge is associated with an edge name (e.g., NP, PP) and a set of nodes (i.e., a start node and an end node) defining the boundaries of the edge.
[00036] Examples of sub-grammars are shown in figure 3. Each sub- grammar contains a set of grammar rules. The rules contained in the sub- grammars enable the predictive cascading unit to determine if the current word and subsequent words or phrases can be combined into a phrase, such as noun phrases, prepositional phrases, verb phrases, and sentences. It should be understood that the number of rules associated with each sub-grammar shown in figure 3 has been severely constrained to provide a manageable example and the actual sub-grammars may contain any number of rules. [00037] The grammar of the English language may be partitioned into any number of sub-grammars (e.g., tens or even hundreds). Each sub- grammar parser associated with a respective sub-grammar is configured to extract certain language feature. For example, one sub-grammar parser may be configured to extract portions of the input sentence that describe time, another sub-grammar parser may be configured to extract portions that describe dates. For example: "I went to California last year", the sub-grammar parser for time will extract "last year" and identify that portion of the sentence as time. Examples of other sub-grammar parsers include location, age, money amount, etc.
[00038] As seen by referring to figure 3, the sub-grammar sentence (S) contains two rules. The first rule in the sub-grammar (S) indicates that a sentence can be composed of a noun phrase (NP) followed by a verb phrase (VP). Additionally, the second rule in the sub-grammar (S) indicates that a sentence (S) can also be formed by combining a sentence (S) with a prepositional phrase (PP). Next, the sub-grammar verb phase (VP) specifies that a verb phase may be composed of a verb such as "saw" or "ran" followed by a noun phrase (NP) or a prepositional phrase (PP). Similarly, the sub- grammar noun phrase (NP) specifies that a noun phrase may be composed of a single word such as "I", "he", "man", "house", etc or may be composed of a combination of words such as a determiner "a" or "the" followed by a noun such as "man", "house" or "telescope". Finally, the sub-grammar prepositional phrase (PP) specifies that a prepositional phrase (PP) may be composed of a preposition such as "in" or "with" followed by a noun phrase (NP). [00039] The operations performed by the present invention may be embodied in the form of software program stored on a machine-readable medium, such as, but is not limited to, any type of disk including floppy disks, hard disks, optical discs, CD-ROMs, and magneto-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Moreover, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
[00040] The multi-parser system constructed according to the present invention may provide a significant advantage over the conventional parsers. At least some of the conventional parsers compare all individual rules defining the entire grammar with a portion of a sentence being analyzed to determine which rules are applicable. Consequently, the conventional parsing process can be time consuming since a grammar typically consists of a very large number of rules and the conven ional parser must search through the entire list of rules each time a different portion of the sentence is analyzed until the whole sentence is parsed. According to one embodiment of the invention, by constructing a le t comer condition table that specifies the left corner conditions for individual sub-grammar parsers prior to analyzing of input sentences and using such left corner condition table for the invocation of applicable sub-grammar parsers at each position along the input sentence, the amount of time required to perform parsing operations is significantly reduced. As a result, the present invention improves the performance of a multi-parser architecture which is an important component for various natural language understanding applications, including extraction and summarization, robust understanding systems. Another advantage achievable by the present invention relates to easily incorporate a large number of new sub-grammar parsers into the multi-parser architecture system.
[00041] While the foregoing embodiments of the invention have been described and shown, it is understood that variations and modifications, such as those suggested and others within the spirit and scope of the invention, may occur to those skilled in the art to which the invention pertains. For example, it should be noted that sub-grammar category searching technique taught by the present invention can be applied to parsing of any natural language and is not limited to English language. The scope of the present invention accordingly is to be defined as set forth in the appended claims.

Claims

CLAIMS What is claimed is:
1. A system comprising: a set of sub-grammar parsers, each sub-grammar parser containing a set of grammar rules to parse a portion of a natural language grammar; and a subsystem to identify at least one of said sub-grammar parsers applicable to a portion of an input text being analyzed and to parse said portion of said input text using said at least one of said sub-grammar parsers identified as being applicable.
2. The system of claim 1, further comprising a left corner condition table containing a plurality of sub-grammar categories, each sub-grammar category associated with a set of left comer conditions, wherein a sub-grammar category applicable to a current word being analyzed is determinable by comparing the current word being analyzed against said set of left corner conditions associated with said sub-grammar categories.
3. The system of claim 2, wherein said left corner condition table comprises a sub-grammar name column containing names of sub-grammar categories and a left corner condition column containing words, symbols and phrases associated with individual sub-grammar categories.
4. The system of claim 1, wherein said subsystem identifies said at least one of said sub-grammar parsers application to the portion of the input text being analyzed by computing an index value associated with the first word of the portion being analyzed and using the index value associated with the current word to lookup a data indicating which sub-grammar parsers are applicable to the first word of the portion being analyzed.
5. The system of claim 1, wherein said subsystem identifies said at least one of said stib-grammar parsers by computing an index value associated with a current word being analyzed and indexing through a lookup table using the index value computed to determine said at least one of said sub-grammar parsers applicable to the portion of the input text being analyzed.
6. The system of claim 5, wherein said lookup table comprises a plurality of rows, each row associated with a symbol indexed according to an indexing scheme, and a plurality of columns, each column associated with a sub-grammar parser.
7. The system of claim 5, wherein said subsystem uses a hash table to compute an index value associated with a current word, said hash table having words which are individually indexed according to a hash value, each word contained within the hash table is associated with a respective index value.
8. The system of claim 1, further comprising: a first module to assign a probability score to individual result edges derived by said sub-grammar parsers; and a second module to determine a syntactic structure of said input text based on the result edges and tlαe probability score assigned to the result edges.
9. A method comprising: analyzing a portion of an input sentence; identifying a sub-grammar category which is applicable to said portion being analyzed; and parsing said portion tising a sub-grammar parser corresponding to said sub-grammar category identified as being applicable.
10. The method of claim 9, wherein said identifying a sub-grammar category comprises: accessing a table containing a plurality of sub-grammar categories, each sub-grammar category associated with a set of words; and comparing a current word being analyzed against the set of words associated with the sub-grammar categories to determine which sub-grammar category is applicable to the current word being analyzed, wherein said current word is the first word of the portion of the input sentence being analyzed.
1.4
11. The method of claim 9, wherein said identifying a sub-grammar category comprises: computing an index value associated with a current word being analyzed, wherein said current word is the first word of the portion of the input sentence being analyzed; and using the index value associated with the current word to lookup a data indicating which sub-grammar parser is applicable to the current word being analyzed.
12. The method of claim 9, wherein said identifying a sub-grammar category comprises: computing an index value associated with a current word being analyzed, wherein said current word is the first word of the portion of the inpxtt sentence being analyzed; and indexing through a lookup table using the index value computed to determine which one of said sub-grammar categories is applicable to the current word being analyzed.
13. The method of claim 12, wherein said index value associated with the current word is derived by using a hash table having words, each individual word indexed according to a hash value and associated with a unique index value.
14. The method of claim 9, further comprising: assigning a probability score to individual result edges derived from parsing; and determining a syntactic structure of said input sentence based on the result edges and the probability score assigned to the result edges.
15. The method comprising: assigning an input edge for each word in an input sentence; constructing a stack by placing the edges in a defined order; analyzing each edge in the stack individually; identifying a sub-grammar category is applicable to a current edge being analyzed; and parsing the current edge using a sub-grammar parser corresponding to said sub-grammar category identified as being applicable.
16. The method of claim 15, further comprising: updating the stack with result edges derived from parsing; assigning a probability score to the result edges derived from parsing; and determining a syntactic structure of said input sentence based on the result edges and the probability score assigned to the result edges.
17. The method of claim 15, wherein said identifying a sub-grammar category comprises: accessing a table containing a plurality of sub-grammar categories, each sub-grammar category associated with a set of words; and comparing a current word being analyzed against the set of words associated with the sub-grammar categories to determine which sub-grammar category is applicable to the current word being analyzed.
18. The method of claim 15, wherein said identifying a sub-grammar category comprises: computing an index value associated with a current word being analyzed; and indexing through a lookup table using the index value computed to determine which one of said sub-grammar categories is applicable to the current word being analyzed.
19. The method of claim 18, wherein said index value associated with the current word is derived by using a hash table having words, each individual word indexed according to a hash value and associated with a unique index value.
20. A machine-readable medium that provides instructions, which when executed by a processor, cause said processor to perform operations comprising: analyzing a portion of an input sentence; identifying a sub-grammar category which is applicable to said portion being analyzed; and parsing said portion using a sub-grammar parser corresponding to said sub-grammar category identified as being applicable.
21. The machine-readable medium of claim 20, wherein said operation of identifying a sub-grammar category comprises: accessing a table containing a plurality of sub-grammar categories, each sub-grammar category associated with a set of words; and comparing a current word being analyzed against the set of words associated with the sub-grammar categories to determine which sub-grammar category is applicable to the current word being analyzed.
22. The machine-readable medium of claim 20, wherein said operation of identifying a sub-grammar category comprises: computing an index value associated with a current word being analyzed; and indexing through a lookup table using the index value computed to determine which one of said sub-grammar categories is applicable to the current word being analyzed.
23. The machine-readable medium of claim 22, wherein said index value associated with the current word is derived by using a hash table having words, each individual word indexed according to a hash value and associated with a unique index value.
24. The machine-readable medium of claim 20, wherein said operations further comprises: assigning a probability score to individual result edges derived from parsing; and determining a syntactic structure of said input sentence based on the result edges and the probability score assigned to the result edges.
PCT/CN2001/001065 2001-06-29 2001-06-29 Predictive cascading algorithm for multi-parser architecture WO2003003241A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2001/001065 WO2003003241A1 (en) 2001-06-29 2001-06-29 Predictive cascading algorithm for multi-parser architecture
CNA018235743A CN1545665A (en) 2001-06-29 2001-06-29 Predictive cascading algorithm for multi-parser architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2001/001065 WO2003003241A1 (en) 2001-06-29 2001-06-29 Predictive cascading algorithm for multi-parser architecture

Publications (1)

Publication Number Publication Date
WO2003003241A1 true WO2003003241A1 (en) 2003-01-09

Family

ID=4574822

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2001/001065 WO2003003241A1 (en) 2001-06-29 2001-06-29 Predictive cascading algorithm for multi-parser architecture

Country Status (2)

Country Link
CN (1) CN1545665A (en)
WO (1) WO2003003241A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2419432A (en) * 2004-10-20 2006-04-26 Ibm A method and system for creating hierarchical classifiers of software components in natural language processing
US8315874B2 (en) 2005-12-30 2012-11-20 Microsoft Corporation Voice user interface authoring tool
WO2016003644A1 (en) * 2014-06-30 2016-01-07 Quixey, Inc. Query understanding pipeline
US9658999B2 (en) 2013-03-01 2017-05-23 Sony Corporation Language processing method and electronic device
US11275892B2 (en) 2019-04-29 2022-03-15 International Business Machines Corporation Traversal-based sentence span judgements

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8301993B2 (en) 2007-02-08 2012-10-30 International Business Machines Corporation Mapping raw spreadsheet data into graphs on arbitrary subsets of the data
US8117530B2 (en) 2007-02-19 2012-02-14 International Business Machines Corporation Extensible markup language parsing using multiple XML parsers
US11106872B2 (en) * 2018-01-09 2021-08-31 Jyu-Fang Yu System and method for improving sentence diagram construction and analysis by enabling a user positioning sentence construction components and words on a diagramming interface

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999041680A2 (en) * 1998-02-13 1999-08-19 Microsoft Corporation Segmentation of chinese text into words
WO2000011576A1 (en) * 1998-08-24 2000-03-02 Virtual Research Associates, Inc. Natural language sentence parser
GB2348522A (en) * 1998-11-10 2000-10-04 Ibm Automatically generating computer language grammars in browsable form

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999041680A2 (en) * 1998-02-13 1999-08-19 Microsoft Corporation Segmentation of chinese text into words
WO2000011576A1 (en) * 1998-08-24 2000-03-02 Virtual Research Associates, Inc. Natural language sentence parser
GB2348522A (en) * 1998-11-10 2000-10-04 Ibm Automatically generating computer language grammars in browsable form

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2419432A (en) * 2004-10-20 2006-04-26 Ibm A method and system for creating hierarchical classifiers of software components in natural language processing
US7657495B2 (en) 2004-10-20 2010-02-02 International Business Machines Corporation Method and system for creating hierarchical classifiers of software components to identify meaning for words with multiple meanings
US8315874B2 (en) 2005-12-30 2012-11-20 Microsoft Corporation Voice user interface authoring tool
US9658999B2 (en) 2013-03-01 2017-05-23 Sony Corporation Language processing method and electronic device
WO2016003644A1 (en) * 2014-06-30 2016-01-07 Quixey, Inc. Query understanding pipeline
US9747365B2 (en) 2014-06-30 2017-08-29 Quixey, Inc. Query understanding pipeline
US11275892B2 (en) 2019-04-29 2022-03-15 International Business Machines Corporation Traversal-based sentence span judgements

Also Published As

Publication number Publication date
CN1545665A (en) 2004-11-10

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
US5680628A (en) Method and apparatus for automated search and retrieval process
US6816830B1 (en) Finite state data structures with paths representing paired strings of tags and tag combinations
Novichkova et al. MedScan, a natural language processing engine for MEDLINE abstracts
JP4413349B2 (en) Sample text based language identification method and computer system
Nelken et al. Arabic diacritization using weighted finite-state transducers
US8938384B2 (en) Language identification for documents containing multiple languages
JP4024861B2 (en) Natural language parser with dictionary-based part-of-speech probabilities
US5752052A (en) Method and system for bootstrapping statistical processing into a rule-based natural language parser
US20030125928A1 (en) Method for retrieving similar sentence in translation aid system
US20060253273A1 (en) Information extraction using a trainable grammar
US20060277028A1 (en) Training a statistical parser on noisy data by filtering
US7236925B2 (en) Left-corner chart parsing
US20030217066A1 (en) System and methods for character string vector generation
CA2366485C (en) System and method for parsing a document
JP2002215619A (en) Translation sentence extracting method from translated document
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JP2013502643A (en) Structured data translation apparatus, system and method
EP3598321A1 (en) Method for parsing natural language text with constituent construction links
US7398210B2 (en) System and method for performing analysis on word variants
US8041556B2 (en) Chinese to english translation tool
US7684975B2 (en) Morphological analyzer, natural language processor, morphological analysis method and program
WO2003003241A1 (en) Predictive cascading algorithm for multi-parser architecture
Khoo et al. Using statistical and contextual information to identify two‐and three‐character words in Chinese text
JP3139624B2 (en) Morphological analyzer

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 20018235743

Country of ref document: CN

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP