CN106663094B

CN106663094B - Method and system for linear generalized LL recognition and context-aware parsing

Info

Publication number: CN106663094B
Application number: CN201580037492.4A
Authority: CN
Inventors: 洛林·G·克雷默三世
Original assignee: Luo LinGKeleimosanshi
Current assignee: Luo LinGKeleimosanshi
Priority date: 2014-07-11
Filing date: 2015-07-10
Publication date: 2020-03-27
Anticipated expiration: 2035-07-10
Also published as: CN111522554A; EP3167382A1; CN106663094A; EP3167382A4

Abstract

A computer system and a method of grammar analysis to generate code for runtime recognition to produce a graphical representation of a list or lists of directions to be followed by a given statement during subsequent parsing. The computer system implements the method to: parsing the grammar to create an intermediate representation; constructing a graph for analysis representing all features of the grammar, including recursion, alternation, grouping of alternatives, and rotation; processing each decision point in the graph to generate the intermediate representation; generating code for an identification function that returns a list of directions used in the runtime resolution decision; and patch each decision point marker to reference or inline the top-level identification code of each decision point.

Description

Method and system for linear generalized LL recognition and context-aware parsing

Cross Reference to Related Applications

This application claims priority from U.S. application No.14/796,782 filed on 10/7/2015 and U.S. provisional application No.62/023771 filed on 11/7/2014, which are incorporated by reference in their entirety.

Technical Field

Embodiments of the present invention as shown and described herein relate to the parsing of symbol strings in a formal language according to the rules of a formal grammar. In particular, embodiments provide an improved generic LL (derived from left to right, left-most) parsing process with o (n) performance in the absence of ambiguity.

Background

Parsing (also called parsing) is the process of parsing a set of symbols, which may be a string or similar format, where a "string" is a sequence of items (in this case symbols), where the sequence is finite and the symbols are selected from a set of possible symbols called an alphabet. The parsing process is applicable to natural languages, computer languages, and similar systems that include DNA sequences. The parsing process applies a set of rules that are specific to the formal grammar of the language being processed. The parsing process is a computer-implemented process, and the term is used in the sense understood in the field of computer science and more specifically in the field of computational linguistics.

In computational linguistics, parsing is also understood to refer to the formal analysis of sentences or other word strings in natural or computer languages by computer processors and programs to yield their components and to derive parse trees that show the grammatical relationships between each component and each other component. The parse tree may also contain semantic information and other relevant information about the sentence or word string being processed.

In some applications in computer science, parsing processes are used for analysis of a computer language and involves parsing input code into its constituent parts to facilitate the subsequent functionality of a compiler and/or interpreter that functions to convert code written in one computer language into an executable form (i.e., a computer language that a computer processor is capable of executing).

Disclosure of Invention

A computer system and a method of grammar analysis to generate code for runtime recognition to produce a graphical representation of a list or lists of directions (directions) to be followed by a given statement during subsequent parsing. The computer system implements the method to: parsing the grammar to create an intermediate representation; constructing a graph for analysis representing all features of the grammar, including recursion, alternation, grouping of alternatives, and rotation; processing each decision point in the graph to generate the intermediate representation; generating a code of an identification function which returns a direction list used in the runtime analysis decision; and patch each decision point marker (token) to reference or inline the top-level identification code of each decision point.

Drawings

Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements. It should be noted that references to "an" or "one" embodiment of the invention in this disclosure are not necessarily to the same embodiment, and they mean at least one.

FIG. 1 is a flow diagram of one embodiment of a graphical representation of code.

FIG. 2 is a flow diagram of one embodiment of a graphical representation of code.

Fig. 3A to 3D are diagrams of one embodiment of derivation according to the recursive subgraph of fig. 1.

FIG. 4 is a flow diagram of a process for code generation.

FIG. 5 is a flow diagram of one embodiment of a decision point analysis process.

FIG. 6 is a flow diagram of one embodiment of a decision point analysis list process.

Fig. 7, 8 and 9 show the simplest version of the runtime analysis process (using the restricted GOTO model).

Fig. 7 is a flow chart of the overall framework.

Fig. 8 is a flowchart of the process of adding to the direction list.

Fig. 9 is a flowchart of a process of constructing a directivity pattern.

FIG. 10 is a diagram of one embodiment of a parsing system.

FIG. 11 is a diagram of one embodiment of a compiler and linker system.

FIG. 12 is a diagram of one embodiment of a computer system to implement a parsing process.

Detailed Description

In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.

Operations described in the flowcharts in the figures will be described with reference to the exemplary embodiments shown in the figures. However, it should be understood that the operations described in the flowcharts may be performed by embodiments other than the embodiments of the present invention discussed with reference to the figures, and that the embodiments discussed with reference to the diagrams in the figures may perform operations different from those discussed with reference to the flowcharts in the figures.

The techniques illustrated in the figures can be implemented using code and data stored and executed on one or more electronic devices (e.g., an end station, a network element, etc.). Such electronic devices store and communicate (inter-communicate and/or communicate with other electronic devices via a network) code and data using non-transitory machine-readable or computer-readable media, such as non-transitory machine-readable or computer-readable storage media (e.g., magnetic disks, optical disks, random access memories, read-only memories, flash memories, and phase-change memories) and transitory computer-readable transmission media (e.g., electrical, optical, acoustical or other forms of propagated signals, such as carrier waves, infrared signals, digital signals). Further, such electronic devices typically include a set of one or more processors coupled to one or more other components, such as one or more memory devices, user input/output devices (e.g., a keyboard, a touch screen, and/or a display), and network connections. As used herein, "group" refers to any positive integer number of items. The coupling of the set of processors to other components is typically accomplished through one or more buses and bridges (also referred to as bus controllers). Storage means represent one or more non-transitory machine-readable or computer-readable storage media and non-transitory machine-readable or computer-readable communication media. Thus, the memory device of a given electronic device typically stores code and/or data for execution on the set of one or more processors of the electronic device. Of course, one or more portions of embodiments of the invention may be implemented using different combinations of software, firmware, and/or hardware.

As used herein, a network element (e.g., a router, switch, bridge, etc.) is a piece of networked equipment, including hardware and software, that communicatively connects other equipment on a network (e.g., other network elements, end stations, etc.) to one another. Some network elements are "multi-service network elements" that provide support for multiple network functions (e.g., routing, bridging, switching, two-layer aggregation, session border control, multicasting, and/or subscriber management) and/or support for multiple application services (e.g., data, voice, and video).

Overview

The general analytical method in the prior art has O (n)³) And (4) performance. Embodiments herein describe a method and system for a parsing process with o (n) performance without ambiguity (i.e., there are multiple alternative interpretations for a given "character" sequence, these are referred to as "bilinguals" in natural language). A parser created according to the principles and techniques described herein operates by: firstly, an identification phase is performed, which generates a sequence of directions for navigating decision points; subsequently or interspersed with the recognition phase, a parsing phase is performed which follows the direction while performing all actions that may be used by a compiler, DNA sequence analyzer or similar component that implements the parsing process to solve the problem that may be handled by the finite state automaton.

To generate the recognizer and parser that implements this process, the grammar is represented in two alternative forms: a graph with directed edges connecting vertices representing token types to be matched, and an Abstract Syntax Tree (AST). The graph is used to identify decision points (vertices with multiple output edges) and processed to generate recognizer code. The AST represents a resolution code for generating a call including a recognizer routine. In general, an AST representation is generated from a grammar described by a production (production), and then elements for constructing a graphic are derived from the AST form production by production, and then the graphic is constructed. During graph construction, the AST form becomes complex and some productions are duplicated to ensure that decisions are uniquely named; these production equations are added to the original AST; complex AST is used to generate the resolution code.

Introduction content

Formal languages are described by grammars consisting of "rules" or "production" that define the full legal sequence of "characters" that make up a sentence in the language. A context-free language is a formal language that may be described by a context-free grammar (CFG), where no context-free grammar G is defined as

G＝(N,T,P,S)

Where N is a set of non-terminal symbols, T is a set of language-allowed terminal symbols (characters), and P is a set of production equations (or rewrite rules) of the form:

n- > < sequence of terminal and non-terminal >,

n is a non-terminator;

and S is a start symbol (another non-terminal symbol).

For linguistic processing, the CFG specification is augmented by the alternate operator "|" so that there is a unique production formula for each n. This is known as the Backus-Naur Form (BNF), which is often Extended (EBNF) to include groupings of phrases (phrases), and repetitions — A? Indicating 0 or 1 occurrence. As used herein, non-terminators start with lower case letters and terminators start with upper case letters; the production equation is written as:

a < symbol sequence >;

in fact, a more general formal language is defined in an extended CFG to support "actions" described in a general programming language (including semantic predicates). Semantic predicates are used to query context information and invalidate alternatives when the predicate evaluates to false.

Graphical representation

The augmented CFG can be represented in the form of a graph with directed edges connecting vertices ("nodes"), where each vertex is occupied by a single terminator, action, or semantic predicate. Since CFG supports recursion, two additional node types need to be added, which are needed to support save and restore return context information; logically, these node types push (push) and POP (POP) values from the stack, where the POP node determines which node follows based on the values retrieved from the stack. To understand this, consider the recursive generation equation:

a:A a B

|B a C

|D；

in the first alternative, the return path must involve identifying B after identifying a and recursively invoking a; whereas the return path after identifying B in the second alternative and recursively invoking a must involve identifying C. In graphical form, before a recursive call (loop back to the first call of a), a PUSH _ CONTEXT node must be inserted; a POP _ CONTEXT node is inserted at the end of the production (where the semicolon appears). For processing convenience, a "(" and ")" node representing a possible decision point (as with the POP _ CONTEXT node) is also supported.

If the following form of the initial generation formula is added:

s:E a F；

the resulting grammar can be represented as a graph as shown in fig. 1.

In fig. 1, the forward arrows are black and the return rings are shown in bold. The dashed arrows represent connections that may or may not exist after construction, but are not followed during code generation. There is a decision point in fig. 1: as with the "(" node, POP node has multiple possible following nodes, but the choice of direction is determined by the value on the stack, rather than by a match of the input characters, the decision as to which path to take from the "(" node) is determined by the next token ("character") taken from the input stream, which is called the LA (1) decision because it requires only one read ahead (LA) token.

Some decisions (such as those shown in fig. 2) may require more than one read-ahead mark.

In fig. 2 there are two decision nodes: BLOCK node and EOB node. These nodes are presented to indicate structure and not values to be matched, so the BLOCK decision is between the sequences ABC and ABD, and the EOB decision is between ABC, ABD and ABE. Both require three pre-read markers (the third marker distinguishes C from D and E) to determine which path to take leading from the decision.

Depending on the sequence of labels to be matched, a complex decision may include multiple decision points before a suitable path can be distinguished.

Decision analysis during grammar graph processing

The basic method for analyzing decision points is as follows: a list of alternate paths is constructed from the decision points and the next token type along all paths is compared. Singleton flag type indicates "missing item (dropout)": if such a token type is matched, when the type matches at runtime, the path from which the token type was found will be valid, and the corresponding path index may be returned, and the corresponding path may be deleted from the list. If the remaining paths all represent the same token type, then code is generated to match the token type and the remaining paths are advanced to the next token. Otherwise, the list is split by tag type and the process continues for each generated list. For each new list, generating a function and generating code to call each new function, and terminating processing for the previous path list; the body of the new function will be generated when processing the list generated at the decision point. Difficulties arise when encountering secondary decision points or POP nodes; these situations will be described in more detail below. Avoiding duplicate list processing by using a list processing table checked after constructing a new list; if an equivalent list is found in the table, the reference to the newly created list is replaced with a reference to the list found in the table and the new list is discarded without further processing. This prevents the analysis process from entering an infinite loop.

In FIG. 2, the analysis of the first decision (the "(" at the left side of the graph) is processed as follows.first, a list of alternative paths is constructed that point to each A in the graph.A matching A code is generated and the paths are advanced to each node B.both match, thus producing a matching B code.

Decision analysis at runtime

At runtime, the recognition (read ahead) phase may alternatively run to completion before parsing; or may run interspersed, wherein the recognition code is invoked at a selected decision point when encountered during parsing. The difference between these modes of operation is that the run-to-completion requires that only the identification code for the first decision point encountered, no resolution direction is returned before the resolution input is exhausted, while the interspersed recognition returns whenever each alternative path is reduced to a single path. This discussion will assume the latter approach; in either case, however, the direction of resolution consists of a list (or a multi-column table or graph, discussed further below) of indices that select paths leading from the decision points. As an example, the list {3,2,1} directs parsing to employ alternative 3 at a first decision point, alternative 2 at a second decision point, and alternative 1 at a third decision point. A fourth decision would require another recognition call.

Special mark type

In one example implementation described herein, most of the token types referenced in the grammar represent tokens to be matched at runtime, but some represent decisions or required processing actions and are both present in the AST and graphical representations.

They include:

start of BLOCK alternate packet (decision point used when there is an alternate)

End of EOB packet (decision node for loop).

Close is used for (·) ×, decision points, which are denoted (·.) +) in the figure. .

POP _ CONTEXT is as described above.

PUSH _ CONTEXT-as described above.

SEMANTIC _ PREDICATE PREDICATE decision points. This is a decision node in the graphical representation, but not in the AST where it represents an action (code executed at runtime).

SYNTACTIC _ PREDICATE starts a forced read ahead; the alternative following SYNTACTIC _ PREDICATE is only valid if the PREDICATE matches in its entirety.

END _ SYNTACTIC _ indicate ENDs the read ahead.

For BLOCK, EOB and close markers, it is important to use the same markers used in the AST representation in the graphical representation so that edits in one representation can be read in the other. These markers are embedded in the AST or in the graphical representation or in the "carrier" node in both.

AST to graphics conversion

The basic approach taken is to traverse the AST for the start production, thereby generating a marker sequence. These markers are then processed to construct a grammar graph: in addition to recursive production, which requires special processing, non-terminal tags are extended inline. The extension consists of: traverse the AST for the generation of the reference (first encountered), or (subsequent encounters) copy the referenced AST and run the copy; these traversal steps generate a sequence of sequentially processed markers and also extend non-terminal characters when they are encountered. When copying, the copy's production will be renamed, as will the non-terminal references. Recursively generated expressions are an exception to inline extensions; it is dynamically converted into a cycle.

Handling recursion

To track recursion, a stack of non-terminal names nested by the tracking rules is maintained during graph construction; checking the non-terminal against the contents of the stack before adding another level of nesting; if so, add the PUSH _ CONTEXT node to the graph and array of CONTEXT values for the recursively generated expressions. The node contains a context index value set to its offset to the array. When the expansion of the recursive generator is complete, the array is used to create a loop from the PUSH _ CONTEXT node to the beginning of the generator in the graph; the POP _ CONTEXT node is attached to the graph and looped back to the node following the PUSH _ CONTEXT node in the graph. This is illustrated in fig. 3A to 3D, which follow the derivation of the recursive subgraph according to fig. 1.

In traversing the AST to generate FIGS. 3A-3D according to the "a" grammar of paragraph 0039 and 0041, first, an initial "(" is added to the graph, then the A node is added, after which recursive references are encountered, at which point an array is created to hold pointers to PUSH _ CONTEXT nodes, a PUSH _ CONTEXT node with index 0 is added to the graph, and the 0 th entry of the array is made to point to the PUSH _ CONTEXT node, from which FIG. 3A is generated, then B is appended to the path, and processing continues with the next alternate, a second B (paragraph 0040) is added, and a second recursive call to a is encountered adding a PUSH _ CONTEXT1 node, and updating the array so that the 1 st entry points to the node, FIG. 3B shows an intermediate graph, then C is added to complete the path, then the construction proceeds to a third path, and D is added. Then, the path utilization in the decision ")" are combined to reach the terminal of the production. Because there is an array constructed for this production, a POP _ CONTEXT node is added; this is shown in fig. 3C. This array is then used to generate the loop shown in FIG. 3D: first, a loop is constructed from the PUSH _ CONTEXT node to the original decision node, then the array entry is added, and a loop from the POP _ CONTEXT node is added to complete the graph.

Decision processing

Processing each decision node in the graph to generate an identification code; decision nodes may be collected during graph construction or through systematic traversal of the graph. The processing consists of the building and traversal of an alternate list, wherein each step involves comparing the values of the various alternates in the list with each other; if all nodes have a matching token type, a matching code (or intermediate representation) is generated for that token type and each alternative is advanced to the next node. When the alternative reaches the mark type which is not matched with other alternatives, generating an end code for the alternative and continuing list processing on the rest of the alternatives. The node containing the predicate and the secondary decision and PUSH _ CONTEXT or POP _ CONTEXT forcibly executes special processing; when the list contains multiple token types after deletion of the singleton type, the list will be split into lists for each token type, and each list is processed separately.

Handling ambiguities

Ambiguity arises when the various alternative paths from the decision points merge (share a common node) during the decision process. In particular, in this case, the two paths are described as syntactically ambiguous; that is, there are marker sequences that can be interpreted in two different ways. Generalized parsing involves advancing all alternatives. Another approach, the analytical expression grammar (PEG) method, arbitrarily selects one alternative and discards the others. PEG type disambiguation is useful for formal languages; formal languages tend to have limited syntactic ambiguity and no semantic ambiguity.

With PEG type disambiguation there is always only one valid resolution and the resolution can be guided by the list of directions. For generalized parsing, the recognition stage generates multiple directed graphs, rather than lists, multiple graphs representing multiple lists, each of which may be used to guide parsing.

For generalized parsing, there are multiple valid parses, so the direction list is replaced with a directed graph representing all valid direction lists in a compact form.

Code Generation model-Java or other languages lacking unrestricted GOTO

When creating each list/graph, two functions are created for the list/graph: the main function will save the code to do the marking process, while the auxiliary function is called when exiting the main function. The exit function builds a list/graph of alternate indices. Each alternative in the list is represented by a data structure containing a decision index and a current graph node. Generating a code to return an index value when a representation of an alternative is added to the list (copied from the previous list or reconstructed from the alternative at the decision node); if the code is for a newly created representation, the code includes adding its alternate index to the index list. As the decision process progresses, the code for the marker match is added to the main function of the list, as well as the code for special processing (semantic predicates, a "switch" statement when the list is split, an if... else.. then statement for processing the semantic predicates, a PUSH statement for PUSH _ CONTEXT, and a POP statement for processing POP _ CONTEXT followed by a switch statement). For each recursively generated formula, there is a named stack; the push and pop are done on the appropriate stack for the context in which they appear.

Generating a recursive function call by a loop structure; to support this recursion, all lists that have the same set of current nodes (and arrangement of current nodes in the list) must be treated as equivalents and mapped to a single function in the generated code. This also helps to minimize the generated code by avoiding multiple functions with the same subject. When creating the path list, the path list is checked against the list handling table, and if one path list is found in the table, a reference to the previous list is replaced, and if an equivalent is not found in the table, a new entry is added to the table.

Code generation model-language with unrestricted GOTO

An unfortunate feature of the code generation model described above is that it can result in a very deep function call stack. However, in each generated function, only one local variable "index" is required and defined, and thus, the call of the main function can be replaced with GOTO. This leaves the problem of a return call to the exit function. These can be handled by "publishing" the return function address to the list of return function addresses in the master function immediately prior to the GOTO call. The list (C/C + + code) may be processed as follows:

while(offset<listSize){

index＝(*list[offset++](index)；

}

where the list value is added from the last entry in the list down to the first entry, indexed by "offset", and then processed up. This avoids any stack depth issues and can be quite fast.

Code generation processing

FIG. 4 is a flow diagram of a process for code generation. The process begins by parsing the grammar of the input language to create an AST or similar intermediate form of each grammar-generating expression to represent the input language (block 401).

The process then constructs a graph for further analysis based on the production AST (block 403). Constructing a graph by traversing the grammar, starting with the start generative, expanding non-terminal characters when they are encountered in the grammar (generative references); adding a terminator to the graph upon encountering the terminator; BLOCKs (groupings of alternatives) are constructed by inserting BLOCK vertices that are expanded into paths (one for each alternative) that are merged at the inserted EOB vertex. The loop is processed by adding a loopback edge between the EOB node and the corresponding BLOCK node, and the recursively generated expressions are processed as previously described. Once the entire graph is completed, the recognizer generation process begins traversing the graph to identify decision points. Each decision point is used as a starting point for the decision process when it is encountered during traversal. It is checked whether there are remaining decision points to process (block 405), and if there are no remaining decision points to process, the process generates a resolution code (block 413) and stores the resolution code to storage (block 417). In some embodiments, the generated code may then be incorporated into executable code or similar code that may be used by a machine or human (block 415).

If all decision points have not been processed, the process retrieves the next decision point (block 407). The decision point is processed to generate an Intermediate Representation (IR) or code (block 409). Code to patch decision point markers to reference or inline decisions (block 411); this causes the parsed code generated from the entire grammar representation to reference the identification function. The traversal continues by checking if there are still decision points to be processed (block 405), and if so, continuing to process the next decision point until all decision points have been processed.

FIG. 5 is a flow diagram of one embodiment of a decision point analysis process. The process begins by initializing the IR or inline code of the method (block 501). A list of alternatives is built and then code for alternatives is generated that start with a different symbol or set of symbols than the symbols or sets of symbols that start other alternatives (block 503). The generated list is then added to the work queue (block 505). The work queue then begins processing each list in the queue until all lists are exhausted, at which point the decision point analysis ends (block 509). It is checked whether there are still lists to be processed (block 507) and when there are still lists to be processed, the next list is selected and processed (block 511) and other lists may be generated during processing. The processing of each list is described further below.

FIG. 6 is a flow diagram of one embodiment of a decision point analysis list process. List processing begins with retrieving the list from the work queue and comparing the next tag type for each alternative (block 601). The PUSH _ CONTEXT node is then advanced and the index value is added to the local stack (block 603). The list is then checked for inclusion of a POP _ CONTEXT node (block 605). If a POP _ CONTEXT node is included, the list is copied and POP _ CONTEXT processing is performed before processing is complete (block 607).

PUSH _ CONTEXT and POP _ CONTEXT nodes represent recursive generation. The recursively generated expression becomes a double-loop structure in the grammar graph as shown in fig. 3A to 3D. The second loop has the same number of iterations as the first loop; furthermore, when there are multiple alternatives (multiple recursive references), the selection of the second loop alternative depends on the first loop alternative; thus, PUSH _ CONTEXT notes which alternative was taken in the first loop, and POP _ CONTEXT retrieves the index value and selects the alternative for the second loop. There are two special cases where one or the other cycle disappears: left recursion and tail recursion. The left recursive instance does not match the end from the start of the production to the PUSH CONTEXT node; in these cases, it makes no sense to POP index values from the stack when processing the corresponding POP _ CONTEXT node, because there is no way to determine how many index values should have been pushed onto the stack. Instead, the POP _ CONTEXT node is considered a simple loop back for this alternative. Similarly, for the tail recursion, there are no symbols to match in the loop from the POP _ CONTEXT node (any index values on the stack are irrelevant) and no code need be generated for the PUSH _ CONTEXT or POP _ CONTEXT instance.

If the list does not include a POP _ CONTEXT node, then processing is performed for the missing item (dropout) (the tag type or set of tag types that identifies the single path and therefore terminates the recognition analysis upon a match) (block 609). The list is then checked for multiple marker types and/or decision points (block 611). If there are multiple mark types and/or decision points, the process may refine (elidorate) the decision for each path (e.g., build a new list) (block 613), then the decision points may be processed and the list split if necessary and added to the work queue before completion (block 625). At the decision point, the path list is compressed to remove empty paths and checked against the table of the path list (list handling table) and stored context data. If the table contains an entry for a given path list, processing of the list ends and the reference to the list is replaced with a reference to the entry of the table. If not, an entry referencing the list is added to the table. This avoids infinite recursion in handling loopback decisions. Without multiple marker types or decision points, the process checks the SEMPRED (block 615), which represents testing context information to allow or disallow semantic predicates for the alternative that follows. The semantic predicate is simply a boolean test, so the list is divided into an "if true" list and an "else" list, and the code is generated and of the form: "if (semired condition) if true (); else if _ false (); "(block 617). The "if true" list contains all current alternatives, including alternatives gated by semantic predicates; the "if false" list omits the predicate alternates. If there is no SEMPRED, then a check is made to see if all of the marker types in the list match (block 619). If the flags do not match, the list is split and the resulting list is added to the work queue as discussed above (block 625). If all paths in the list refer to the same token type, a match code is generated for that token type (block 621). After this is done, processing continues with the next node for processing (block 623). If not, the list is broken down into lists by tag type, where each list is added to the work queue, translation code for invoking the appropriate list function is generated for the matching tag type, and list processing terminates.

Fig. 7 to 9 show the simplest version of the runtime recognition process (using the restricted GOTO model). Fig. 7 provides an overall framework, while fig. 8 shows the process of adding to the direction list, and fig. 9 shows the process of constructing the directional pattern. Fig. 8 and 10 illustrate example code and runtime decisions.

With respect to FIG. 7, the recognition process ideally begins with a call to a recognition function (block 701). Within the function, tokens are matched (block 703) and the input stream is pushed to the next token before reaching a decision point or returning a call. It is checked whether the input stream has been exhausted (block 705) and if not, it is checked whether a decision point has been found (block 709). If the exhaustion or return call is reached, the exit is invoked to end the analysis (block 707). The decision takes the form of an if... else.. then or switch statement to select the function to call and return the index value (block 711). If the tag type fails to match or is invalid at the decision point, an error code is returned or an exception is thrown. The return call starts the exit processing described in fig. 9 and fig. 11.

The runtime decision reflects the code generated by the analysis process. The major runtime decisions that may be encountered include: 1) miss switch (dropout switch); 2) a semantic predicate (SEMPRED) if (..) conditional statement; 3) POP _ CONTEXT loop; 4) switching split lists; and 5) function calls (reflecting list merging/refinement).

An example Java code for a function with a missing handover is shown below.

The code matches a, advances the input stream, matches B, advances the input, and then goes to miss-switching. If C, D or E subsequently matches, the endloop0 function is called with the appropriate index value to begin the exit process. If none match, -1 (error code) is returned.

The following example code shows a typical semantic predicate (baz () returns true or false).

FIG. 8 is a flow chart illustrating exit processing when PEG-type (or other) disambiguation is used. Nesting of calls from figure 7 is disassembled to construct a list of directions and manage the recursive stack. The first step in the exit process is to call an exit function (block 801). In the example code above, the exit function is endloop0_ l (), with an index value; the index value reflects a position in the parse path list and is used to select an action to perform (block 803). It is checked whether the action is to be appended to the direction (block 805). If so, the process continues by adding a value to the list of directions (block 807), managing the recursive stack, etc., before returning a new index value for further processing (if needed) (block 809). When all calls have been processed (block 811), the process completes and the exit call ends.

The following code provides an example of a form of a simple Java exit function that corresponds to the matching function mentioned above.

Looking at the previous code, the 0 th alternative ends with C and a call to endloop0_ l with index 0. This results in an indextrack.add (0) call; this function represents the formulation of a quadratic decision, so the "0" direction index is inserted into the direction list, and the index value is restored to the index from the pre-formulation list. The D/1 case is similar, while the E case represents a hold path (hold path) from a previous decision. Thus, case 3 only affects the returned index value.

The code presented below provides a complete Java example of the recognizer and an example of the grammar (single production) that generated the recognizer.

Grammar and its application

Generated recognizer code

This particular example generates a recursive recognizer and has some concerns. The top-level function recursion GenTestO () represents the decision that starts with the first "(" EndredthionGenTestO () function constructs an index list entry for this decision, and recursion GenTestO _ l () represents the two (BCD) + loops and E or F behind them (note, the contribution of endrecursion GenTestO _ l () to the index list.) the body of this function matches BCD, then matches E or F, and loops back or makes recursive calls.

Corresponding analytic function

The parse code matches a and then evaluates the direction list before reaching a decision switch (decision switch). The first case matches B, C and D before the loop decision point is reached and the decision list is again checked. If the list is empty, the inline identification code executes; e ends the loop, while B (the content of set 0) causes the loop to continue. The second case is similar, except that E is replaced by F.

FIG. 9 is a flow diagram illustrating one embodiment of a flow diagram of a generalized exit process, involving construction of a directional pattern rather than the directional list of FIG. 8. These two processes are very similar, except that instead of processing a single value return, the present process loops through the return list to construct a new return list. The process begins by calling an exit function with an index or list of nodes (block 901). It is checked whether the list is empty (block 903), and if so, the process ends after the final completion of the check (block 915). If an entry still exists in the list, the process obtains the next list entry, sets the node and index (block 907). An action is selected based on the index (block 905). If the action is to be appended to a direction, the process adds a direction list inode entry, links to the current node at the index (created if the first node), and sets the node value (block 911). An index or node is added to the return list (block 913). The list is again checked to determine if it is empty (block 903), and if not, the process continues with the next list entry, otherwise, the process ends when complete (block 915).

FIG. 10 is a diagram of one embodiment of a computer system to implement a parsing process. The computer system may include a processor 1001 or set of processors to execute a parser generator (parser) 1003 implementing the parsing process described herein. In another embodiment, parser generator 1003 and the associated processing may be performed in a distributed manner across multiple computer systems in communication with each other. For clarity, embodiments are described below as being executed in a single computer system. However, those skilled in the art will understand that the principles and configurations described herein are consistent with other embodiments having other configurations (e.g., a distributed implementation).

In one embodiment, the parser generator 1003 includes a grammar processor 1005 front end for processing the input grammar as an AST, a linear generalized LL (LGLL; LL stands for "left-slanted" parsing tree and is a shorthand notation that is parsed from top to bottom), an analysis engine 1007 and a code generator 1009 to create recognizer and parser codes that divide the responsibilities of parsing the input grammar and generating the codes as described above. Grammar parser 1005 takes source code 1011 as output and generates intermediate representation (AST)1015 as described above. The LGLL analysis engine 1007 takes the AST as input and generates a graphical representation 1017 as described above. It then processes the graph as described in fig. 4-6, building and using list handling table 1025 to avoid duplicate list analysis. The code generator 1009 processes the AST 1015 and the graphical representation 1017 using the above functions to construct a generated parser 1019 (which is a source program to be integrated into a target application).

Grammar 1011, AST 1015, graphical representation 1017, list handling table 1025, and generated parser 1019 may be stored in working memory 1021 of the computer system and may be accessed by parser generator 1003 via bus 1013 or similar interconnection. The processor 1001 may communicate via a bus 1013, a chip level or system area network or similar communication system, and the working memory 1021 stores source code 1015, an intermediate representation 1015, a graphical representation 1017 and a generated parser 1019. The working memory 1021 may be any type of storage device, such as solid state random access memory. In addition to storing compiled code, work memory 1021 may store any of the above data structures, work memory 1021 and persistent storage (not shown) responsible for storing executable code for the compiler and parser and their subcomponents.

Working memory 1021 may communicate with processor 1001 through bus 1013. However, those skilled in the art will appreciate that the bus 1013 does not strictly indicate that only a bus separates the processors 1001 and that the bus 113 may include intermediate hardware, firmware, and software components that enable communication between the processors 1001 and a compiler parser generator 1003. Those skilled in the art will appreciate that computer systems are provided by way of example, and not limitation, and that well-known structures and components of computer systems have been omitted for clarity.

In one embodiment, the parser generator and parser components implement the functionality described with reference to FIGS. 4-9 to generate an intermediate representation of code to be used in generating executable code or similar output. The parser may be implemented as part of the front-end compiler, distributed across multiple components of the compiler, or may be implemented separately from the compiler. In other embodiments, the parser is not used in the software compilation process, but is used in other types of code processing (e.g., DNA sequence processing or similar naturally occurring or artificially created data sequences). Those skilled in the art will appreciate that the processes and structures described herein with reference to software compilation may be adapted or suitable for processing other code for other environments.

FIG. 11 is a diagram of another computer system embodiment of a resolution process. A computer system comprising a processor 1101 executes a parser generator 1103 comprising a compiler 1105 and a linker 1107. The compiler may include a parser as described above, or may operate on the output of the parser and work with the linker to generate executable code. The compiler 1105 is separate from the compiler and may be executed as parallel processing, especially for large programs. The compiler 1105 reads in the LGLL parser/parser (source code) and user source code and generates object code.

Subsequently, the linker 1107 is responsible for generating executable code 1119 with a complete LGLL parser 1119 by operating on the target code 1115 and the library 1117. Linker 1107 links together compiler-generated object code 1115 to create executable code 1119 that may run on a target platform. Linker 1107 combines individual object code 1115 with library code 1117 and other object code similar code to achieve platform-specific operations and execution of source code.

Fig. 12 shows a diagrammatic representation of machine in the exemplary form of a computer system within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed. In alternative embodiments, the machine may be connected (e.g., networked) to other machines in a Local Area Network (LAN), an intranet, an extranet, or the internet. The machine may operate in the capacity of a server or a client machine in client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a Personal Computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any combination of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

An exemplary computer system includes a processing device 1202, a main memory 1204 (e.g., Read Only Memory (ROM), flash memory, Dynamic Random Access Memory (DRAM) such as synchronous DRAM (sdram) or Rambus DRAM (RDRAM)), a static memory 1206 (e.g., flash memory, Static Random Access Memory (SRAM), etc.), and a secondary memory 1218 (e.g., a data storage device) that communicate with each other via a bus.

Processing device 1202 represents one or more general-purpose processing devices (e.g., a microprocessor, central processing unit, etc.). More specifically, the processing device may be a Complex Instruction System Computing (CISC) microprocessor, Reduced Instruction System Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, processor implementing other instruction systems, or a processor implementing a combination of instruction systems. The processing device 902 may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), network processor, or the like. The processing device 902 is configured to execute the compiler 926 and/or parser for performing the operations and steps described herein.

The computer system 900 may also include a network interface device 908. The computer system may also include a video display unit 1210 (e.g., a Liquid Crystal Display (LCD) or a Cathode Ray Tube (CRT)), an alphanumeric input device 1212 (e.g., a keyboard), a cursor control device 1214 (e.g., a mouse), and a signal generation device 916 (e.g., a speaker).

The secondary memory 1218 may include a machine-readable storage medium 1228 (or more particularly, a non-transitory computer-readable storage medium) having stored thereon one or more sets of instructions (e.g., the parser and/or compiler 1226) embodying any one or more of the methodologies or functions described herein (e.g., the parser generator 1226). The parser generator 1226 (i.e., implementing the methods described herein) may also reside, completely or at least partially, within the main memory 1204 and/or within the processing device 1202 during execution thereof by the computer system 1200; the main memory 1204 and the processing device also constitute machine-readable storage media. The compiler 1228 may also be transmitted or received over a network via the network interface device 1208.

The machine-readable storage medium 1228 (which may be a non-transitory computer-readable storage medium) may also be used to persistently store the modules of the parser generator 1226. While the non-transitory computer-readable storage medium is shown in an exemplary embodiment to be a single medium, the term "non-transitory computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "non-transitory computer-readable storage medium" shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine to cause the machine to perform any one or more of the methodologies of the present invention. The term "non-transitory computer readable storage medium" shall accordingly be taken to include, but not be limited to, solid-state memories and optical and magnetic media.

The computer system 1200 may also include a parser generator 1226 for implementing the functionality of the compilation process described above. The modules, components and other features described herein may be implemented as discrete hardware components or integrated within the functionality of hardware components such as ASICS, FPGAs, DSPs or similar devices. Additionally, the modules may be implemented as firmware or functional circuitry within a hardware device. Further, the module may be implemented in any combination of hardware devices and software components.

Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the description, discussions utilizing terms such as "executing," "determining," "setting," "converting," "constructing," "traversing," or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Embodiments of the present invention also relate to apparatuses for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other types of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as described in the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but may be practiced with modification and alteration within the spirit and scope of the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A computer-implemented grammar analysis method of generating code for runtime recognition to produce a graphical representation of a list or lists of directions to be followed by a given sentence during subsequent parsing, the method comprising the steps of:

parsing the grammar to create an Abstract Syntax Tree (AST) representation;

constructing a graph representing all features of the grammar for analysis, including recursive, alternating, alternate groupings and loops;

processing each decision point in the graph to generate an intermediate representation, the decision point being a vertex having a plurality of output edges;

generating a code of an identification function, the identification function returning a list of directions used in a runtime parsing decision; and is

Each decision point marker is patched to reference or inline the top-level identification code of each decision point.

2. The computer-implemented method of claim 1, wherein the method further comprises: synchronization of a graphical representation of a grammar with an abstract syntax tree or other internal representation of the grammar used to generate a parser.

3. The computer-implemented method of claim 1, wherein the grammar features are represented as nodes containing special token types in the graph or nodes containing grammar-specific token types that represent terminators in a source grammar, the nodes containing special token types in the graph comprising: any one of a POP _ CONTEXT or PUSH _ CONTEXT node for representing stack manipulation for recursive management, wherein POP _ CONTEXT is a decision node; any of a BLOCK or EOB node for representing a grouping of alternatives, wherein the node is a decision point that is also present in the abstract syntax tree; a SEMPED node for implementing semantic predicates for context-aware recognition and parsing also present in the abstract syntax tree.

4. The computer-implemented method of claim 1, wherein the decision point analysis comprises the steps of:

an intermediate representation or inline code of the initialization method;

constructing a list of alternative paths starting at the decision point;

generating code for a singleton and removing a terminated path from the list;

adding the path list to a work queue;

processing the path list to generate runtime identification code and to generate other entries in the work queue; and

and processing the path list in the work queue until the work queue is emptied.

5. The computer-implemented method of claim 4, wherein processing the path list comprises the steps of:

obtaining a path list from a work queue;

progressively processing the worklist, wherein each step includes processing of a special token type and/or a grammar-specific token type;

as a node is processed, advancing from a current node to a next node in the path list;

terminating path list processing when a new path list is created and added to the work queue, or replaced by an equivalent it previously encountered, and

the initial path list values are saved in a list handling table.

6. The computer-implemented method of claim 4, wherein processing the path list further comprises the steps of:

checking whether the path list has a PUSH _ CONTEXT node; and

PUSH _ CONTEXT node is advanced, adding the index value to the path local stack.

7. The computer-implemented method of claim 4, wherein processing the path list further comprises the steps of:

checking whether the path list has a POP _ CONTEXT node; and

the path list is copied, and POP _ CONTEXT processing is performed when the path list includes a POP _ CONTEXT node.

8. The computer-implemented method of claim 4, wherein processing the path list further comprises the steps of:

checking whether the path list has a plurality of marker types and/or decision points; and

the decision points are refined to create a new path list and the path list is split when multiple grammar-specific token types are represented.

9. The computer-implemented method of claim 4, wherein processing the path list further comprises the steps of:

checking a SEMPED flag representing a semantic predicate of runtime, context-aware recognition and parsing; and

processing the path list to generate code for runtime SEMPED-oriented runtime decisions.

10. The computer-implemented method of claim 4, wherein processing the path list further comprises the steps of:

checking whether all the mark types are matched; and

a matching code for each token type is generated.

11. A non-transitory machine-readable medium having stored therein a set of instructions which, when executed by a computer system, cause the computer system to perform a method of grammar analysis to generate code for runtime recognition to produce a graphical representation of a list or lists of directions to be followed by a given statement during subsequent parsing, the executed instructions causing the computer system to perform operations comprising:

parsing the grammar to create an Abstract Syntax Tree (AST) representation;

generating code to identify a function that returns a list of directions used in a runtime parsing decision; and is

12. The non-transitory machine-readable medium of claim 11, wherein execution of the instructions results in the computer system performing operations further comprising: the graphical representation of the grammar is synchronized with the abstract syntax tree or other internal representation of the grammar used to generate the parser.

13. The non-transitory machine-readable medium of claim 11, wherein the grammar features are represented as nodes containing special token types in the graph or nodes containing grammar-specific token types that represent terminators in a source grammar, the nodes containing special token types in the graph comprising: any one of POP _ CONTEXT or PUSH _ CONTEXT nodes representing stack manipulations for recursive management, wherein POP _ CONTEXT is a decision node; any one of a BLOCK or EOB node representing a grouping of alternatives, wherein the node is a decision point also present in the abstract syntax tree; a SEMPED node for implementing semantic predicates for context-aware recognition and parsing also present in the abstract syntax tree.

14. The non-transitory machine-readable medium of claim 11, wherein the decision point analysis comprises the steps of:

the intermediate representation or inline code of the initialization method;

constructing a list of alternative paths starting at the decision point;

generating code for singletons and removing terminated paths from the list;

adding the path list to a work queue;

and processing the path list in the work queue until the work queue is emptied.

15. The non-transitory machine-readable medium of claim 14, wherein processing the path list comprises the steps of:

obtaining a path list from a work queue;

the initial path list values are saved in a list handling table.

16. The non-transitory machine-readable medium of claim 14, wherein processing the path list further comprises the steps of:

checking whether the path list has a PUSH _ CONTEXT node; and

17. The non-transitory machine-readable medium of claim 14, wherein processing the path list further comprises the steps of:

checking whether the path list has a POP _ CONTEXT node; and

18. The non-transitory machine-readable medium of claim 14, wherein processing the path list further comprises the steps of:

19. The non-transitory machine-readable medium of claim 14, wherein processing the path list further comprises the steps of:

20. The non-transitory machine-readable medium of claim 14, wherein processing the path list further comprises the steps of:

checking whether all the mark types are matched; and

a matching code for each token type is generated.

21. A computer system configured to implement a method of grammar analysis to generate code for runtime recognition to produce a graphical representation of a list or lists of directions to be followed by a given statement during subsequent parsing, the computer system comprising:

a non-transitory machine-readable medium storing source code and a parser generator; and

a processor coupled to the non-transitory machine-readable medium, the processor configured to execute the parser generator, the parser generator configured to: parsing the grammar to create an Abstract Syntax Tree (AST) representation; constructing a graph representing all features of the grammar for analysis, including recursive, alternating, alternate groupings and loops; processing each decision point in the graph to generate an intermediate representation, the decision point being a vertex having a plurality of output edges; generating code for an identification function that returns a list of directions used in the runtime resolution decision; and patch each decision point marker to reference or inline the top-level identification code of each decision point.