PREDICTIVE CASCADING ALGORITHM FOR MULTI-PARSER
ARCHITECTURE
BACKGROUND OF THE INVENΗON
Field of the Invention
[0001] The present invention generally relates to language processing, and in particular to multi-parser architecture for language processing.
Description of the Related Art
[0002] To process natural language sentences, a language parser may be used to derive a syntactic tree that shows syntactic relationship between the words and phrases in input text data, such as a sentence. The language parser is an important component in various natural language applications. Due to the complexity of natural language itself, the grammar rules that describe the natural language is also complex. The traditional way of using a single language parser that encompasses the entire grammar of a natural language may lead to an inefficient, inaccurate, and often very large language processing system.
[0003] With rapid increase in number of Internet users, the demand for usage of natural language applications in the Internet have increased in the recent past and will continue to increase. The problems associated with using a single language parser to analyze the entire grammar of a particular language have become more apparent with the increasing demand for natural language applications over the Internet. In general, the existing parsing techniques are not fast enough for natural language applications in certain environments such as the Internet.
BRIEF. DESCRIPTION OF THE DRAWINGS
[0004] Figure 1 is a block diagram of a multi-parser system for parsing natural language sentences according to one embodiment of the invention. [0005] Figures 2A-2C are flowcharts of parsing an input sentence according to one embodiment of the invention.
[0006] Figure 3 shows sub-grammars according to one embodiment of the invention.
[0007] Figure 4 is a table diagram showing the contents of a left corner condition table constructed according to one embodiment of the invention.
[0008] Figure 5 is a table diagram showing the contents of a sample symbol table.
[0009] Figure 6 is a table diagram showing a portion of a sample hash table.
[00010] Figure 7 is table diagram showing a sample left corner condition table constructed according to another embodiment of the invention.
[00011] Figure 8 is a lattice diagram illustrating parsing of an example sentence according to one embodiment of the invention.
[00012] Figure 9 is a lattice table according to one embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[00013] In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order to avoid obscuring the present invention.
[00014] Figure 1 depicts a multi-parser system 100 for parsing natural language sentences according to one embodiment of the invention. The multi- parser system 100 includes a predictive cascading unit 102, a probability score assigning unit 110, a Viterbi unit 112, a left comer condition table 108 and a hash table 120. Also included in the multi-parser system 100 are a number of sub-grammar parsers 114-118. It should be noted that a grammar of a natural language (e.g., English, Chinese, etc.) usually consists of a large number of rules. As a result, the size of a single parser to handle the entire grammar may be too large for programming languages. To avoid problems associated with encompassing the entire grammar with a single parser, a grammar is partitioned into a number of sub-grammar categories. In this regard, different sub-grammar parsers 114-118 are configured to handle different sub-grammar categories of a grammar, such as sentence (S), noun phrase (NP), verb phrase
(VP) and prepositional phrase (PP), etc. Each sub-grammar parser contains a set of grammar rules associated with the corresponding category of the grammar.
[00015] In one embodiment, the predictive cascading unit 102 is configured to invoke only those sub-grammar parsers at each location in the input sentence when it meets certain criteria, called left corner condition. If it meets one of the left corner conditions associated with a sub-grammar parser, it invokes that particular sub-grammar parser at that particular place. Otherwise, that sub-grammar parser will not be invoked. When a sub-grammar parser is invoked at certain location in the input sentence, the sub-grammar parser checks the validity of the sequence of the current word and subsequent word or phrase being analyzed. If the sub-grammar parser determines that such sequence is valid, it will return one or more result edges to indicate that such sequence has satisfied one or more rules specified therein. Once a result edge has been returned by a sub-grammar parser, the lattice table 104 and the stack 106 maintained by the predictive cascading unit 102 are updated. This parsing process is repeated until all of the words and phrases within the sentence have been examined. Once the parsing process has been completed, the probability score assigning unit 110 examines the lattice table 104 and assigns a probability score to individual result edges listed in the lattice table 104. Then, a syntactic structure of the input sentence may be determined by the Viterbi unit 112 by selecting a best path based the probability score assigned to the result edges. [00016] Figures 2A-2C depicts operations of the multi-parser system according to one embodiment of the invention. The parsing operations can be broken down into three stages, namely a pre-processing stage, a main processing stage and a post-processing stage. During the pre-processing stage and as shown in figure 2A, the predictive cascading unit 102 receives an input sentence from an external source in block 200. Then in block 205, the predictive cascading unit 102 constructs a lattice table containing a list of edges initially derived by assigning an input edge for each word of the input sentence. Additionally during the pre-processing stage, the input edges are sorted in reverse topological order and placed into a stack in block 210. [00017] Once the pre-processing stage has been completed, the multi- parser system proceeds to the main processing stage as shown in figure 2B.
The main processing stage consists of a loop (blocks 215-245) to identify one or more sub-grammar parsers applicable to each edge contained in the stack and to parse the respective edge using the sub-grammar parsers identified as being applicable. The loop (blocks 215-245) processes each entry loaded in the stack individually starting from the highest entry. If the stack is not empty (block 215, no), the predictive cascading unit proceeds to block 220 where an edge from the top of the stack is popped out. The most recently popped out of the stack will be referred as a current edge.
[00018] As noted above, a grammar of a natural language typically consists of a large number of rules. The present invention provides a way to reduce the amount of time needed to identify grammar rules that may be applicable to the current edge. In one embodiment, a left corner condition table is used to identify those sub-grammar categories that may be applicable to the current word without having to search through the entire set of grammar rules.
[00019] Accordingly, once a current edge has been retrieved from the top of the stack, the predictive cascading unit identifies one or more sub-grammar categories applicable to the current edge based on the left corner condition table in block 225. Then in block 230, the multi-parser system parses the current edge by invoking the sub-gram ar parsers associated with the sub- grammar categories identified as being applicable. If the sub-grammar parser produces a result edge (block 235, yes), then multi-parser system takes the result edge and updates the stack by pushing the result edge into the stack in block 240. In addition to updating the stack, the multi-parser system also takes the result edge derived from the respective sub-grammar parser and adds it to the lattice table in block 245. This loop (blocks 215-245) is continued until all of the entries in the stack have been popped out and processed. [00020] When the main processing stage has been completed, the multi- parser system proceeds to the post processing stage shown in figure 2C. The post-processing stage begins in block 250 where a probability score is assigned to each edge in the lattice table. Then based on this probability score information, a best path may be selected based on a Viterbi algorithm in block 255. Finally in block 260, a syntatic tree is derived based on the best path selected by the Viterbi unit.
[00021] Figure 4 is a table diagram showing the contents of a sample left corner condition (LCC) table, which may be used to rapidly identify which sub-grammar categories are applicable to a current word being analyzed. The LCC table is constructed and stored, for example in data storage such as a hard drive or other storage device, prior to analyzing input sentences. The LCC table has columns as follows: a sub-grammar name column 404 containing the names of the sub-grammar categories and a left corner condition column 406 containing a number of words, symbols and phrases (also referred to as "left corner condition") associated with the sub-grammar category. The LCC table also contains a sub-grammar identification (GID) column, which contains a value uniquely identifying each row in the table that can be used to quickly invoke the sub-grammar parser associated with the corresponding row (i.e., sub-grammar category). In use, the sub-grammar parser applicable to the current word can be determined by matching the current word being analyzed with those words (i.e., left corner conditions) listed each row of the LCC table. It should be appreciated the size of the LCC table has been severely constrained to provide a manageable example. The actual LCC table may contain any number of sub-grammar categories (e.g., tens or even hundreds) and each sub-grammar category may contain any number of left comer conditions (e.g., hundreds or even thousands).
[00022] For example, the noun phrase (NP) sub-grammar category includes words that are associated with a noun phrase. If the current word or phrase being analyzed is "the", the word "the" is looked up in this LCC table by comparing it with words listed under the left corner condition column. Since the word "the" is associated with a noun phrase (NP) sub-grammar category, the sub-grammar parser associated with noun phrase (NP) will be selected to parse the word or phrase beginning with "the". It should be understood that there are certain words that may reside in more than one sub- grammar categories. For example, the word "copies" may be a noun in certain sentence context and may also be a verb is other context. In cases where the current word is associated with more than one sub-grammar categories, all corresponding sub-grammar parsers will be invoked to parse the current word. Consequently, there may be more than one result edge produced during parsing of a particular word or phrase.
[00023] In one embodiment, to simplify the process of identifying sub- grammar categories that may be applicable to the current word being analyzed, each word, phrase or symbol recognizable by the multi-parser system is assigned a unique symbol identification number (SID). Figure 5 is a table diagram shown a symbol table. The symbol table contains a symbol column 502 and a symbol identification (SID), which contains a value uniquely identifying each word, phrase or symbol associated with each row in the SID table that can be used to quickly locate and retrieve information from a lookup table shown in figure 7. For example, in the sample symbol table shown in figure 5, the word "man" is arbitrary assigned SID number "8". [00024] As noted above, at least some of the sub-grammar categories listed in an actual LCC table will be associated with a large number of words, phrases and symbols. For example, the noun phrase (NP) sub-grammar category may be associated with all words found in a dictionary that are in some ways associated with a noun phrase. Consequently, in order to determine whether or not a particular word is associated with a particular sub- grammar category, the current word being analyzed must be compared against the entire list of words associated with each individual sub-grammar categories. In this regard, the present invention incorporates a search technique to reduce the amount of time and processing necessary to identify those sub-grammar categories listed in the LCC table that may be associated with the current word. In one embodiment, a hash table is utilized to expedite the process of searching through the left corner condition table. [00025] Figure 6 is a table diagram showing a portion of a sample hash table, which may be used to determine an SID associated with a particular word, symbol or phrase. The illustrated hash table has an index column 602 and a symbol column 604. Each entry (i.e., row) in the hash table includes a symbol (e.g., a symbol, a word or a phrase) which is indexed according to a hash value. The hash value associated with each symbol may be obtained by applying a hash function to the corresponding symbol. For example, the hash value of a symbol may be computed based on the American Standard Code for Information Interchange (ASCII) code of the symbol. In this regard, since the ASCII code for the word "man" is "10997110", the hash value or index of the word "man" is 1866 [i.e., index = (Q09*31+97)*31+110)mod2000 = 1866]. The
hash table also contains a symbol identification (SID) column which contains a value uniquely identifying each symbol in the table to rapidly locate and examine the corresponding row in the lookup table of figure 7. In one embodiment, the SID specified in the hash table corresponds with the SID assigned to each symbol in the symbol table. In this regard, the hash table may be used to lookup the SID associated with a particular word by first computing the hash value of the particular word and then using the hash value to index through the hash table in order to obtain the proper SID. [00026] Figure 7 is a table diagram showing a lookup table constructed in accordance with one embodiment of the invention. By using the SID obtained from the hash table to index through the lookup table, the predictive cascading unit 102 is able to rapidly identify one or more sub-grammar categories applicable to a current word, phrase or symbol being analyzed. The lookup table is constructed and stored in data storage prior to analyzing of input sentences. The illustrated lookup table has a symbol column 702 containing symbols (e.g., symbols, words or phrases) recognizable by the multi-parser system, a symbol identification (SID) column 704 and a number of sub- grammar parser columns 706-712. Each row in the lookup table is associated with a particular word that is indexed by the corresponding SID value. It should be appreciated the size of the lookup table has been severely constrained to provide a manageable example. The actual lookup table may contain any number of symbols (e.g., thousands) and any number of sub- grammar parser columns (e.g., tens or hundreds). [00027] The sub-grammar parser column 710 corresponds to a noun phrase (NP) sub-grammar parser and contains indications of whether the symbol associated with each of the lookup table is applicable to the sub- grammar parser (NP). Similarly, the columns 706, 708 and 710 correspond to a sentence (S) sub-grammar parser, a verb phrase (VP) sub-grammar parser and a prepositional phrase (PP) sub-grammar parser. For example, the eighth row of column 710 in the lookup table has a "man" in the symbol column and a "1" in the sub-grammar (NP) column, indicating that the sub-grammar parser (NP) is applicable to the symbol "man". In contrast, the eighth row of column 710 has a "0" in the sub-grammar (VP) column 708, indicating that the sub- grammar parser (VP) is not applicable to the symbol "man". In use, an index
value (e.g., SID) associated with a particular word may be derived by using a hash table or any other suitable indexing method. For example, the word "man" has an index value "8" according to a sample hash table provided in figure 6. Then, by using this index value "8" to index through the lookup table, the word "man" can be easily located. Once a particular word has been located in the lookup table by using a proper index value, the predictive cascading unit interrogates the entire row by traversing from one column to another column to identify one or more sub-grammar parser fields that contains "1".
[00028] In this illustrated embodiment, the sub-grammar searching process consists of two major stages. The first stage involves identifying an index value associated with the current word being analyzed. Then, by indexing through the lookup table using the index value derived, the sub- grammar categories applicable to the current word ma be rapidly identified. Once sub-grammar categories have been identified, the multi-parser system invokes only those sub-grammar parsers that correspond to the sub-grammar categories identified as being applicable.
[00029] In an alternative embodiment, the determination of the applicable sub-grammar parsers may be further expedited by indexing directly to the lookup table using a hash value or any other suitable indexing method. Accordingly, in this alternative embodiment, the symbols specified within the symbol column 702 of the lookup table shown in figure 7 will be rearranged such that they are indexed based on the hash value or other suitable indexing value.
[00030] Advantageously, by using the left corner condition table shown in figure 4 or the lookup table shown in figure 7, the multi-parser system of the present invention is able to significantly expedite the process of determining which grammar rules may be applicable to a particular word or phrase. According to one embodiment of the invention, the multi-parser system is able to rapidly eliminate majority of sub-grammar categories that are not applicable to a current word being analyzed, without having to go through the entire set of rules listed within individual sub-grammar categories. This means that the multi-parser system is able to focus its analysis only to those sub-grammar categories that have been identified as being relevant to the parsing of the
current word without wasting time searching through all sub-grammar categories.
[00031] The parsing operations of the present invention will be described in reference to a lattice diagram shown in figure 8. In tine example shown in figure 8, an input sentence "I saw a man in the house" is provided. The lattice diagram includes nodes (Nl through N8) and edges (Ll through L17) between the nodes. Each edge represents a portion of the input sentence. The input edges Ll through L7 correspond to each individual word in the input sentence. The edges L8 through L17 correspond to the result edges generated by the sub- grammar parsers. Each edge corresponds to a word or a portion of the input sentence and is used to indicate the part of speech, the grammatical structure or phrase structure of the corresponding portion thereof. [00032] In the illustrated embodiment, a cascading composition algorithm is employed in which the underlying syntactic structure of a sentence is determined from bottom-up. In this embodiment, individual words of the sentence are initially analyzed right-to-left and subsequently to intermediate phrases until the analysis of the sentence is complete. [00033] Initially, after the input sentence has been received, each individual edges Ll though L7 will be placed in a stack in a reverse topological order such that the word "house" will be at the top of the stack. Accordingly, the word "house" will be popped out of the stack and will be designated as the current word being processed. Then, the predictive cascading unit 102 identifies which sub-grammar categories are associated with the current word. First the hash value associated with the current word "house" is computed. Then using the hash value to index through the hash table shown in figure 6, the predictive cascading unit 102 is able to determine the SID associated with the current word "house". Then by indexing through the lookup table (figure 7) using the SID derived, the predictive cascading unit 102 is able to identify which sub-grammar categories may be applicable by traversing across the ninth row; zero "0" means that that particular sub-grammar category is not applicable and one "1" means that that particular sub-grammar category may be applicable. According to the lookup table, the word "house" is associated with the sub-grammar parser (NP). Once the sub-grammar parser (NP) is invoked to parser the word "house", a result edge L8 is derived. The edge L8
extending from node N7 to N8 is labeled as a noun phrase (NP) because the word "house" met the requirement of one of the rules set forth in the sub- grammar parser (NP).
[00034] The predictive cascading unit 102 proceeds to parse the next word in the sentence. In this example, a noun phrase (NP) sub-grammar parser will be invoked at the next location (input edge L6) in the input sentences. Once the sub-grammar parser (NP) is invoked, it will determine if the sequence of words "the house" is a valid noun phrase. Since the phrase "the house" is one of the valid noun phrases listed in the sub-grammar noun phrase (NP) shown in figure 3, it returns a result edge L9 labeled as a noun phrase to indicate that a determine "the" followed by the noun "house" has satisfied one of the rules specified in the sub-grammar (NP). Once the result edge L9 extending from node 6 to node 8 is labeled as a noun phrase (NP) has been returned by the sub-grammar parser (NP), the lattice table 104 (shown in figure 9) and the stack 105 maintained by the predictive cascading unit 102 are updated with the new result edge. This parsing process is repeated until all entries in the stack have been processed. At the end, a single edge covering the entire sentence is derived. In this example, the edge L17 covering the entire input sentence from node Nl to node N8 is derived at the end by the sub- grammar parser (S) and is labeled as a sentence (S) because the sentence (S) (the result edge L16) followed by a prepositional phrase (PP) (the result edge L10) meets the requirement of one of the rules set forth in the sub-grammar (S) shown in figure 3.
[00035] Figure 9 depicts a lattice table containing a list of edges, which correspond to the edges shown in the lattice diagram of figure 8. In the illustrated lattice table, each edge is associated with an edge name (e.g., NP, PP) and a set of nodes (i.e., a start node and an end node) defining the boundaries of the edge.
[00036] Examples of sub-grammars are shown in figure 3. Each sub- grammar contains a set of grammar rules. The rules contained in the sub- grammars enable the predictive cascading unit to determine if the current word and subsequent words or phrases can be combined into a phrase, such as noun phrases, prepositional phrases, verb phrases, and sentences. It should be understood that the number of rules associated with each sub-grammar shown
in figure 3 has been severely constrained to provide a manageable example and the actual sub-grammars may contain any number of rules. [00037] The grammar of the English language may be partitioned into any number of sub-grammars (e.g., tens or even hundreds). Each sub- grammar parser associated with a respective sub-grammar is configured to extract certain language feature. For example, one sub-grammar parser may be configured to extract portions of the input sentence that describe time, another sub-grammar parser may be configured to extract portions that describe dates. For example: "I went to California last year", the sub-grammar parser for time will extract "last year" and identify that portion of the sentence as time. Examples of other sub-grammar parsers include location, age, money amount, etc.
[00038] As seen by referring to figure 3, the sub-grammar sentence (S) contains two rules. The first rule in the sub-grammar (S) indicates that a sentence can be composed of a noun phrase (NP) followed by a verb phrase (VP). Additionally, the second rule in the sub-grammar (S) indicates that a sentence (S) can also be formed by combining a sentence (S) with a prepositional phrase (PP). Next, the sub-grammar verb phase (VP) specifies that a verb phase may be composed of a verb such as "saw" or "ran" followed by a noun phrase (NP) or a prepositional phrase (PP). Similarly, the sub- grammar noun phrase (NP) specifies that a noun phrase may be composed of a single word such as "I", "he", "man", "house", etc or may be composed of a combination of words such as a determiner "a" or "the" followed by a noun such as "man", "house" or "telescope". Finally, the sub-grammar prepositional phrase (PP) specifies that a prepositional phrase (PP) may be composed of a preposition such as "in" or "with" followed by a noun phrase (NP). [00039] The operations performed by the present invention may be embodied in the form of software program stored on a machine-readable medium, such as, but is not limited to, any type of disk including floppy disks, hard disks, optical discs, CD-ROMs, and magneto-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Moreover, the present invention is not described with reference to any particular
programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
[00040] The multi-parser system constructed according to the present invention may provide a significant advantage over the conventional parsers. At least some of the conventional parsers compare all individual rules defining the entire grammar with a portion of a sentence being analyzed to determine which rules are applicable. Consequently, the conventional parsing process can be time consuming since a grammar typically consists of a very large number of rules and the conven ional parser must search through the entire list of rules each time a different portion of the sentence is analyzed until the whole sentence is parsed. According to one embodiment of the invention, by constructing a le t comer condition table that specifies the left corner conditions for individual sub-grammar parsers prior to analyzing of input sentences and using such left corner condition table for the invocation of applicable sub-grammar parsers at each position along the input sentence, the amount of time required to perform parsing operations is significantly reduced. As a result, the present invention improves the performance of a multi-parser architecture which is an important component for various natural language understanding applications, including extraction and summarization, robust understanding systems. Another advantage achievable by the present invention relates to easily incorporate a large number of new sub-grammar parsers into the multi-parser architecture system.
[00041] While the foregoing embodiments of the invention have been described and shown, it is understood that variations and modifications, such as those suggested and others within the spirit and scope of the invention, may occur to those skilled in the art to which the invention pertains. For example, it should be noted that sub-grammar category searching technique taught by the present invention can be applied to parsing of any natural language and is not limited to English language. The scope of the present invention accordingly is to be defined as set forth in the appended claims.