CN111381826A

CN111381826A - Method and device for generating syntax tree of code file and electronic equipment

Info

Publication number: CN111381826A
Application number: CN201811638890.3A
Authority: CN
Inventors: 冯刚
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-07-07

Abstract

The application relates to the field of computer software development and discloses a method, a device and electronic equipment for generating a syntax tree of a code file, wherein the method for generating the syntax tree of the code file comprises the steps of analyzing each lexical symbol in the code file to be analyzed through a lexical analysis module and generating a corresponding linear linked list when the code file to be analyzed of a preset programming language is received; analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when determining that any lexical symbol belongs to a preset conflict type; and then generating a syntax tree of the code file to be analyzed according to the processing result of the conflict elimination processing. The method of the embodiment of the application can perform static analysis on the written code file through the syntax tree, so that syntax errors, writing errors and the like in the written code file can be accurately and efficiently checked and corrected.

Description

Method and device for generating syntax tree of code file and electronic equipment

Technical Field

The application relates to the technical field of computer development, in particular to a method and a device for generating a syntax tree of a code file and electronic equipment.

Background

In the current computer field, the development technology of a compiler corresponding to a high-level programming language such as C/C + +/JAVA is more and more mature, and the operation of converting a program language into a machine language can be realized by applying the high-level language compiler. However, the current compiler cannot fully and thoroughly analyze the code writing errors existing in the code file one by one, and cannot accurately prompt or correct the code writing errors existing in the code file. Therefore, program developers are required to manually check for errors that exist in the code file before compiling the code file.

The inventor of the application finds that: when the code amount in the code file is large, a great amount of work is caused to a program developer, so that the program developer spends a great amount of time, energy and the like to check errors in the code file, and the checking efficiency is extremely low. Meanwhile, the inventors of the present application found that: according to the part of speech of each lexical symbol in the code file, searching a corresponding lookup table, generating a syntax tree (syntax tree) of the code file, automatically realizing the check and correction of syntax errors, writing errors and the like in the code file, and greatly improving the checking efficiency.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:

in a first aspect, a method for generating a syntax tree of a code file is provided, which includes:

when a code file to be analyzed of a preset programming language is received, analyzing each lexical symbol in the code file to be analyzed through a lexical analysis module and generating a corresponding linear linked list;

analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when determining that any lexical symbol belongs to a preset conflict type;

and generating a syntax tree of the code file to be analyzed according to the processing result of the conflict elimination processing.

Specifically, the predetermined conflict type includes any one of:

the processing of any lexical symbol is a conflict between the move-in processing and the reduction processing;

the processing of any lexical symbol is in conflict with the first reduction processing and the second reduction processing.

Further, before performing collision elimination processing on any lexical symbol, the method further includes:

and saving the current processing state to obtain a first saving result.

Further, performing collision elimination processing on any lexical symbol, including:

performing first target processing on any lexical symbol according to the context, and sequentially performing corresponding processing on the lexical symbols behind any lexical symbol based on the first target processing;

and if no processing error occurs until the retry of the processing of the end symbol is completed, deleting the first saved result and continuing to correspondingly process the subsequent lexical symbol.

Further, still include:

and if processing errors occur in the process of sequentially and correspondingly processing the lexical symbols after any lexical symbol, performing recovery processing according to the first preservation result, performing second target processing on any lexical symbol, and sequentially and correspondingly processing the lexical symbols after any lexical symbol.

Further, the case of performing collision elimination processing on any lexical symbol includes any one of the following:

when any lexical symbol belongs to a first preset type, if any lexical symbol conflicts between reduction processing and shift processing, any lexical symbol is subjected to shift processing;

when any lexical symbol belongs to a second preset type, if any lexical symbol conflicts between the shift-in processing and the reduction processing, determining to perform the shift-in processing or the reduction processing on any lexical symbol according to the linear relation between the lexical symbols;

and when any lexical symbol belongs to a third preset type, if the any lexical symbol conflicts between the first reduction processing and the second reduction processing, determining to perform the first reduction processing or the second reduction processing on any lexical symbol according to the linear relation between the lexical symbols.

Further, after performing collision elimination processing on any lexical symbol, the method further includes:

and performing error recovery processing on the processing result of the collision elimination processing.

Further, analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, including:

sequentially determining the part of speech of each part of speech symbol in the linear linked list, and searching a first lookup table and a second lookup table according to the part of speech of any part of speech symbol when the part of speech of any part of speech symbol is determined to obtain a corresponding search result;

and analyzing any lexical symbol according to the search result.

Further, the predetermined programming language is any one of a C + + programming language and a C language.

In a second aspect, an apparatus for generating a syntax tree of a code file is provided, including:

the analysis module is used for analyzing each lexical symbol in the code file to be analyzed through the lexical analysis module and generating a corresponding linear linked list when the code file to be analyzed of a preset programming language is received;

the first processing module is used for analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when determining that any lexical symbol belongs to a preset conflict type;

and the syntax tree generating module is used for generating the syntax tree of the code file to be analyzed according to the processing result of the conflict elimination processing.

Specifically, the predetermined conflict type includes any one of:

Further, the device also comprises a storage module;

and the storage module is used for storing the current processing state to obtain a first storage result.

Further, the first processing module comprises a first processing submodule and a second processing submodule;

the first processing submodule is used for carrying out first target processing on any lexical symbol according to the context and sequentially carrying out corresponding processing on the lexical symbols behind any lexical symbol on the basis of the first target processing;

and the second processing submodule is used for deleting the first storage result and continuing to perform corresponding processing on the subsequent lexical symbols when no processing error occurs until the processing of the retry end symbol is completed.

Further, the first processing module comprises a third processing submodule;

and the third processing submodule is used for performing recovery processing according to the first storage result when processing errors occur in the process of sequentially performing corresponding processing on the lexical symbols after any lexical symbol, performing second target processing on any lexical symbol, and sequentially performing corresponding processing on the lexical symbols after any lexical symbol.

Further, the device also comprises a second processing module;

and the second processing module is used for carrying out error recovery processing on the processing result of the conflict elimination processing.

Further, the first processing module comprises a part of speech determining submodule and an analyzing submodule;

the part of speech determining submodule is used for sequentially determining the part of speech of each part of speech symbol in the linear linked list, and when the part of speech of any part of speech symbol is determined, the first lookup table and the second lookup table are searched according to the part of speech of any part of speech symbol to obtain a corresponding search result;

and the analysis submodule is used for analyzing any lexical symbol according to the search result.

In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the above method for generating a syntax tree of a code file.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the above-described method of generating a syntax tree for a code file.

According to the method for generating the syntax tree of the code file, provided by the embodiment of the application, each lexical symbol in the linear linked list is sequentially analyzed based on the first lookup table and the second lookup table, and when any lexical symbol is determined to belong to a preset conflict type, corresponding conflict elimination processing is carried out on any lexical symbol, so that the situation that multiple possible processing behaviors can be found according to the lookup tables and which behavior can not be determined to be selected is effectively solved, and the necessary basis for subsequently generating the syntax tree of the code file to be analyzed is laid; according to the processing result of the conflict elimination processing, the syntax tree of the code file to be analyzed is generated, and the method for generating the syntax tree of the code file is provided, so that the compiled code file can be statically analyzed through the syntax tree, syntax errors, compiling errors and the like in the compiled code file can be accurately and efficiently checked and corrected, and time, energy and the like of a program developer are greatly saved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating a method for generating a syntax tree of a code file according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a basic process of conflict resolution processing according to an embodiment of the present application;

FIG. 3 is a basic process diagram of a part of speech determination process according to an embodiment of the present application;

fig. 4 is a schematic diagram of a basic process of obtaining parts of speech of keywords and operators in the part of speech determination process according to the embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process of determining parts of speech by scope information search according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a basic procedure of part-of-speech guessing according to an embodiment of the present application;

FIG. 7 is a diagram illustrating a basic process of error recovery processing according to an embodiment of the present application;

FIG. 8 is a diagram illustrating an overall process of generating a syntax tree of a code file according to an embodiment of the present application;

FIG. 9 is a flow chart illustrating a shift-in process according to an embodiment of the present application;

FIG. 10 is a diagram illustrating a basic structure of an apparatus for generating a syntax tree of a code file according to an embodiment of the present application;

FIG. 11 is a detailed structural diagram of an apparatus for generating a syntax tree of a code file according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Example one

The embodiment of the present application provides a method for generating a syntax tree of a code file, as shown in fig. 1, including:

step S110, when receiving a code file to be parsed in a predetermined programming language, parsing each lexical symbol in the code file to be parsed by the lexical parsing module and generating a corresponding linear linked list.

Specifically, the code file of the predetermined programming language is generally composed of code elements, such as keywords, identifiers, mathematical operators, scope identifiers, and sentence punctuation characters, which are also called lexical symbols (tokens). Through the lexical analysis module, each lexical symbol in the code file to be analyzed can be sequentially combined into a linear linked list (token-list), and a precondition guarantee is provided for the subsequent generation of a syntax tree of the code file.

Hereinafter, symbols such as mathematical operators, scope identifiers, statement punctuations, and the like are collectively referred to as operators, and the first element in the linked list is referred to as token-list.

Further, the basic attributes of the lexical symbols (tokens) are shown in table 1 below, where the left side in table 1 indicates that each lexical symbol corresponds to an english name, and the right side indicates a chinese description or explanation corresponding to each lexical symbol.

TABLE 1 basic Properties of lexical symbols (tokens)

In some cases, in order to simplify the grammar, some lexical symbols (tokens) may be merged during lexical parsing, for example, lexical symbols (tokens) such as ": new", ": delete", ". template", "- > template" are merged into one lexical symbol (token).

And step S120, analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when any lexical symbol is determined to belong to a preset conflict type.

Specifically, when a conflict occurs in the process of searching the first lookup table and the second lookup table according to the part of speech of the lexical symbol, that is, when multiple possible processing behaviors are searched at the same time and which behavior should be selected cannot be determined, the predetermined conflict type corresponding to the current lexical symbol needs to be determined, so that the conflict elimination processing is performed on the current lexical symbol according to the determined predetermined conflict type.

Step S130, according to the processing result of the conflict elimination processing, generating a syntax tree of the code file to be analyzed.

Specifically, after the predetermined conflict type of the lexical symbol is eliminated, the syntax tree of the code file to be parsed may be generated according to a processing result after the conflict elimination processing.

Compared with the prior art, the method for generating the syntax tree of the code file, provided by the embodiment of the application, is characterized in that each lexical symbol in the linear linked list is sequentially analyzed based on the first lookup table and the second lookup table, and when any lexical symbol is determined to belong to a preset conflict type, corresponding conflict elimination processing is carried out on any lexical symbol, so that the situation that a plurality of possible processing behaviors cannot be determined to be selected according to the lookup tables is effectively solved, and the necessary basis for subsequently generating the syntax tree of the code file to be analyzed is laid; according to the processing result of the conflict elimination processing, the syntax tree of the code file to be analyzed is generated, and the method for generating the syntax tree of the code file is provided, so that the compiled code file can be statically analyzed through the syntax tree, syntax errors, compiling errors and the like in the compiled code file can be accurately and efficiently checked and corrected, and time, energy and the like of a program developer are greatly saved.

The embodiment of the present application provides a possible implementation manner, wherein:

the predetermined programming language is any one of a C + + programming language and a C language. The following description will be given by taking the C + + programming language as an example, where the processing procedure of the C + + programming language is the same as that of the C + + programming language, that is, the method provided in the embodiment of the present application is compatible with the C + + programming language and the C + + programming language.

Specifically, the predetermined conflict type includes any one of:

Specifically, before performing collision elimination processing on any lexical symbol, the method further includes:

and saving the current processing state to obtain a first saving result.

Specifically, the collision elimination processing of any lexical symbol includes:

Specifically, the method further comprises the following steps:

Specifically, the case where the collision resolution processing is performed on any lexical symbol includes any of the following:

The collision elimination processing described above is specifically described below:

because of conflict, the generation formula list adopted by the embodiment of the application does not belong to the classic LR1 grammar, but it is difficult to construct the standard LR1 grammar model of the C + + programming language, so the embodiment of the application realizes the conflict elimination processing module.

For example, in a production list file, the function parameter declaration is part of a non-terminator declaration, which may be followed by an initialization list, which may also be composed of "(" start, such as:

int a (0); v/declare variable a with an initial value of 0, "(0)" is an initialization list

int a (int); v/declaration function a, which has a parameter of int type, "(int)" is a list of parameters

This creates a conflict, after reading in two lexical symbols int and a, the symbol is noptr-decirator at the top of the stack, and when reading in "(" is the start of the parameter list, or the start of the initialization list, i.e. it is not certain whether to reduce the top symbol to ptr-decirator, or to continue to move into the state to the matching parameter list, then it is necessary to decide based on the context whether "(" is followed by a parameter list or an initialization list, in some cases a deterministic conclusion can be drawn based on the linear relationship between symbols, and in some cases a deterministic conclusion cannot be drawn, when a deterministic conclusion cannot be drawn, parsing can be done first according to the parameter list, if it is possible to pass (i.e. no parsing error occurs), it is decided to be a parameter list, if a parsing error occurs (i.e. it does not pass), parsing according to the initialization list, i.e. a "retriable mechanism" is provided.

Firstly, the method comprises the following steps: the following generation forces that a move must be made when confronted with a "convention-move conflict". That is, when any lexical symbol belongs to the first predetermined type, if any lexical symbol is conflicted between reduction processing and shift-in processing, any lexical symbol is subjected to shift-in processing. Wherein, the situation that any lexical symbol belongs to the first predetermined type is as follows:

(1)selection-statement→if(condition)statement

when "else" is read in, it is mandatory not to perform the specification, but to move in, i.e. each "else" is paired with the "if" closest to it.

(2)class-specifier→class-head

When defining the member class of a class, ": is read in, forcing the specification not to be carried out, but to be moved in.

The reason for the conflict is that "may be the beginning of the base class list or the beginning of the bit field, and the class-head should not be semantically restricted by the bit field, so that it can be forced to move directly.

(3)exception-specification→noexcept

The "(" when, "(" may be part of a noexcept statement or may initialize a part of a list, forced to specify as a move-in, is read in.

For example: int (/ pf) (int) noexept (0);

where "(0)" can be combined with noexept, specifying pf to point to a function that can throw out any exceptions, but also as the initial value of pf, which is grammatically ambiguous, forces a combination with noexept.

(4) enum-base → type-name and enum-base → type-modifier-seq type-name

When "const" and "vertex" are read in, it is mandatory to specify that a move is made, and the reason for the conflict is because the enumerated variables can be defined as follows:

enum E:int const e＝xxx；

among them, const is ambiguous and there are two explanations:

1. the int specification of the enum E is enum-specific, and the const specification is type-specific-seq

2. The int const rule is about enum-base, the entire rule of enum E: int const is about enum-specific

The specification is forced in the 2 nd way based on the standard.

(5) new-type-id → trailing-type-distributor and new-decompressor → new-ptr-operator

When reading in "+", it is mandatory to specify a move, the reason for the conflict is that the new-expression can be used as the basic expression, and "+" can be interpreted as a pointer or as a multiplier.

Multiplication by the new-expression has no meaning according to the C + + standard, so the new-expression is mandatory to be specified as a move-in.

(6) conversion-type-id → trailing-type-distributor and conversion-decompressor → ptr-operator

When reading in "+", "& &", the reason for conflict is mandatory to specify a move, is that the conversion-type-id can be part of the basic expression, so "+", "&" can be either ptr-operator or operator, and the symbols are mandatory to be ptr-operator.

Secondly, the method comprises the following steps: when the following generation is faced with a 'specification-shift conflict', whether the 'specification' or the 'shift' is required can be judged according to the linear relation between the symbol stack and the symbols, and if the judgment is not made, the retry mode can be started. That is, when any lexical symbol belongs to the second predetermined type, if any lexical symbol has a conflict between the shift-in processing and the reduction processing, it is determined that the shift-in processing or the reduction processing is performed on any lexical symbol based on a linear relationship between the lexical symbols. Wherein, the situation that any lexical symbol belongs to the second predetermined type is as follows:

(1)ptr-declarator→noptr-declarator

in reading "(" it is necessary to determine "(" is the start of the parameter list, or the start of the initialization list.) if it is the start of the parameter list, move in if it is the start of the initialization list, assign the noptr-declarator to pttr-declarator.

(2)type-name→nested-class-name

When reading in "(" it is necessary to judge "(" is the start of the parameter list or the start of the resolver ". if it is the start of the parameter list, it is moved in.

For example (let A be a certain class name):

a (a); v/object definition, "(a)" is a resolver

A, A (int); "(int)" is a parameter List

(3)simple-type-specifier→type-name

If defined for an object, then type-name is classified as simple-type-specific.

For example:

(4)enum-specifier→enum-head

when defining the member enumeration class of the class, reading in ": the generation formula can face reduction-shift conflict, and needs to judge whether the": is followed by the bit field definition or the enumeration base class definition. If the ": is not followed by a type name, the enum-head is classified as enum-specific, otherwise move in.

(5)ptr-abstract-declarator→ptr-operator

When defining a trailing-return-type, ", the generator is read in, and the producer is faced with a reduce-move conflict, and a determination is needed whether" ("is part of the trailing-type-id, or the start of the initialization list.

(6)ptr-abstract-declarator→noptr-abstract-declarator

When defining a trailing-return-type, ", the producer is read in for a reduce-move conflict, and a determination is needed to make a decision" ("is part of the trailing-type-id, or the start of the initialization list.

(7)trailing-type-id→trailing-type-specifier

When defining a trailing-return-type, ", the production is read in, the reduction-move conflict is faced with, and a determination is needed as to whether" ("is part of the trailing-type-id, or the start of the initialization list.

(8)unary-operator→～

When TYPE-NAME or decType is read, it is necessary to determine whether "" is an inversion operator or a destructor identifier. If the operator is negation operator, the 'to' at the top of the stack is reduced to unity-operator. If it is a destructor identifier, move in.

For example:

thirdly, the method comprises the following steps: protocol-protocol conflict, when the following production formula faces protocol-protocol conflict, the specific production formula which needs to be regulated can be judged according to the linear relation between the symbol stack and the symbols. When any lexical symbol belongs to a third preset type, if any lexical symbol conflicts between the first reduction processing and the second reduction processing, the first reduction processing or the second reduction processing is determined to be carried out on any lexical symbol according to the linear relation between the lexical symbols. Wherein, the case that any lexical symbol belongs to the third predetermined type is as follows:

(1) broken-init-list → { } and compound-status → { }

In read in "; and if the symbol stack top is the function, the symbol stack top is required to be judged, if the symbol stack top is the function, the symbol stack top is reduced according to compound-state → { }, and otherwise, the symbol stack top is reduced according to spaced-init-list → { }.

(2) initializer-clause → association-expression and expression → association-expression

During reading in, whether the symbol stack top is a function or not needs to be judged, if so, the specification of expression → association-expression is carried out, otherwise, the specification of initializer-clause → association-expression is carried out.

In addition, when it cannot be judged whether the "move" or the "specification" should be currently carried out, the current situation of the model can be saved as a "snapshot (snapshot)", and then the current situation of the model is tentatively executed according to one action, if an error occurs, the model is restored according to the previously saved snapshot and then executed according to another action, namely a "retry mechanism". Each snapshot corresponds to a "retry end", and if no error occurs after the snapshot is saved until the retry end is read in, the snapshot can be cancelled and the execution can be continued.

For example in the following code: int fun (int () (int (x)));

if the first "(" it cannot be judged whether the parameter list is later, the model snapshot needs to be saved, the retry ending character corresponding to the snapshot is the last ")" is read in, and then is executed according to the parameter list tentatively, if no error occurs from the snapshot saving to the retry ending character reading, the parameter list is considered to be in brackets, and the previously saved snapshot can be cancelled. Fig. 2 shows a schematic process diagram of the collision resolution process.

The judge-CSR in fig. 2 represents an action to be performed according to the current conflict selection, the save-snapshot represents that a symbol stack, a state stack, a scope stack, a current input symbol, etc. are stored in the snapshot, the snapshot-stack represents that a stack is used for organizing between snapshots, i.e. a retry can also occur in the retry process, the roll-back represents that the model is restored to the previous state, the set-trial-end (x) represents that the symbol x is set as a retry end symbol, the cancel-trial-end (x) represents that the symbol x is set as a non-retry end symbol, the roll-back-flag is used for indicating whether the roll-back is performed before, if the roll-back-flag is true, the roll-back is performed, and if the roll-back-flag is false, the roll-back-flag is not performed. Note that token in fig. 2 is a lexical symbol, m and x are entries obtained by looking up a first lookup table (Action table), R denotes reduction, S denotes shift-in, x ═ x (0) denotes first reduction processing or second reduction processing, and x ═ x (1) denotes second reduction processing or first reduction processing.

The embodiment of the present application provides another possible implementation manner, where:

before searching the first lookup table and the second lookup table according to the part of speech of any part of speech symbol and obtaining a corresponding search result, the grammar of the preset programming language can be analyzed in advance to generate the first lookup table and the second lookup table, wherein the method for generating the first lookup table and the second lookup table by analyzing the grammar of the preset programming language comprises the following steps:

determining the category and the sequence corresponding to each lexical symbol in a preset programming language;

dividing the grammar symbols to obtain terminal symbols and non-terminal symbols, wherein the grammar symbols comprise various lexical symbols and categories corresponding to the lexical symbols respectively, the lexical symbols belong to the terminal symbols, the categories belong to the non-terminal symbols, and the non-terminal symbols represent the hierarchical structure of a preset programming language;

determining non-terminal characters to which the various lexical symbols respectively belong according to the corresponding sequences of the various lexical symbols;

and generating a first lookup table according to the pre-generated state of the preset programming language and the terminal symbol, and generating a second lookup table according to the pre-generated state of the preset programming language and the non-terminal symbol.

Wherein the grammar comprises a start symbol and a list of production equations; the start symbol is a predefined non-terminal symbol, and any one of the production formulas in the production formula list characterizes a relationship between a terminal and a non-terminal.

Any one of the generation formulas includes a left part and a right part, wherein the left part is a non-terminal character belonging to the right part, and the right part is a sequence of terminal characters and/or a sequence of non-terminal characters.

The non-terminal character at the left part corresponds to a node in the syntax tree, and any terminal character or any non-terminal character at the right part is a sub-node of the node.

In particular, grammars are precise descriptions of predetermined programming language structures that specify the categories and order of various code elements (i.e., lexical symbols tokens). For example, the grammar specifies that a code file of the C + + programming language is composed of categories such as class declaration, function definition, namespace declaration, etc., wherein the namespace declaration may be composed of categories such as class declaration, function definition, namespace declaration, etc., and such categories and code elements (i.e., lexical symbols token) are collectively referred to as "grammar symbols (symbols)".

Furthermore, the grammar symbol can be divided into "non-terminal" and "terminal", and the "non-terminal" gives the hierarchy of the programming language, and the hierarchy is the key of syntax parsing and is the target of the embodiment of the present application. Intuitively speaking, the category corresponds to "non-terminal", and each lexical symbol (token) that directly constitutes a code is a "terminal". The lexical symbols (tokens) can be directly converted into grammatical symbols (symbols) as terminal symbols, and the non-terminal symbols correspond to a plurality of lexical symbols (tokens) or non-terminal symbols after being classified.

With the code "struct A { int a; }; "for example, the section code is composed of" struct "," A "," { "," int "," a ","; "," } ","; "or equivalent lexical symbols (tokens), i.e. terminators, which the grammar specifies in this order, can be classified as the non-terminator" class-specific "(i.e. declaration of class). Wherein "classify" may also be referred to as "reduce" according to the term in the field of compiler, and "classify" and "reduce" hereinafter have the same meaning.

The general attributes of the grammar symbols are shown in table 2 below, where the left side in table 2 is the english name of each attribute, and the right side is the chinese description or explanation corresponding to each english name.

TABLE 2 general Properties of grammatical notation (symbol)

In addition, the specific symbols have specific attributes that record the information needed by the model at runtime.

Further, the grammar consists of a starting symbol and a list of production equations, as follows:

1. and the starting symbol is a non-terminal character which corresponds to the top-level description of the whole programming language. The start symbol in the embodiment of the present application is "translation-unit", where the start symbol corresponds to the root of the syntax tree.

2. A list of production equations (production-list) describing the composition of the terminal and non-terminal, one being composed of a "left part" of the non-terminal and a "right part" of the sequence of terminal or non-terminal. Wherein any production formula in the list of production formulas can be formally represented as: "left portion → right portion".

For example: selection-status → if (condition) status. The "selection-status" is a non-terminal on the left, and the right is composed of a sequence of keywords "if", terminal "(", non-terminal "condition", and terminal ")", and non-terminal "status", and the sequence "if (condition) status" indicating the right can be classified as the non-terminal "selection-status" on the left.

In the syntax tree, the non-terminal character in the left part of the generation formula corresponds to a node, and the terminal character and the non-terminal character in the right part are sub-nodes. The basic attributes of the generator are shown in table 3 below, where the left side in table 3 is the english name of each attribute, and the right side is the chinese description or explanation corresponding to each english name.

TABLE 3 production base Attribute

left	Left part
		right	The right part
length()	Number of symbols on right

Further, the first lookup table generated based on analyzing the grammar of the predetermined programming language may be an Action table, and the second lookup table may be a Goto table. The Action table is a two-dimensional mapping table of "state-ending character part of speech", that is, one state and one ending character part of speech correspond to one entry, the Action entry can be formally expressed as an Action (state, cat), the state is a state, and the cat is a part of speech. The value of Action (state, cat) can be expressed as a binary (m, x), and there are five cases where m takes on the value as follows:

1. and moving into (S), wherein the new state can be moved into the state stack, and the lexical symbol (token) which is currently read in is converted into the grammatical symbol (symbol) to be moved into the symbol stack. Where, when m is S (move in), the value of x represents the new state.

2. Reduction (R), which represents a continuous sequence of symbols of a certain length at the top of the symbol stack, may be performed according to a certain production formula, which is the right part of the production formula. Wherein, the reduction specifically can be: and popping the symbol sequence from the symbol stack, popping the state sequence with the same length from the state stack, then pressing the left part of the corresponding generation formula into the symbol stack, then searching the top state of the current symbol stack and the left part just pushed into the stack for a new state in a Goto table, and pressing the searched new state into the state stack. When m is R (reduction), x represents the number of the production formula to be reduced in the production formula list, which may be expressed as production-list (x).

3. Reduce-move Conflict (CSR), indicating that it is currently either possible to move or reduce. When m is CSR, x is a binary set with the first component being the state to be moved in and the second component being the production sequence number to be reduced.

4. Reduction-reduction Conflict (CRR), meaning that reduction can currently be performed on different production formulas, when m is CRR, x is a binary group, and its two components respectively represent the production formula numbers to be reduced.

5. NULL (NULL), tabular error, i.e., the model does not accept the current lexical symbol (token), when m is NULL, x has no meaning, and it is agreed to be NULL (NULL).

The Action table is obtained from a production list file in a grammar through an LR1 algorithm. For the convenience of discussion, in the embodiment of the present application, an Action (state, cat) is used to represent an Action entry, and when there is no cat parameter, an Action (state) represents all part-of-speech sets corresponding to the state in the Action table.

Further, the Goto table is a two-dimensional mapping table of "state-non-terminal name", that is, one state and one non-terminal correspond to one entry, each entry represents one state, and the entries in the Goto table may be represented as Goto (state, symbol-name), where the state is the state and the symbol-name is the non-terminal name. The Goto table is used when reducing production, and represents the state that the parser should be in after specification. The current state stack top state corresponds to each non-terminal in the Goto table, respectively indicating the various possibilities to be reduced next.

Wherein, the Goto list is obtained from a production list file in a grammar by an LR1 algorithm. For the sake of discussion, the embodiment of the present application uses Goto (symbol-name) to represent Goto table entries, and Goto (state) represents all non-terminators corresponding to the state in the Goto table when there is no parameter symbol-name.

Further, a symbol stack (symbol-stack) and a state stack (state-stack) are used to store the symbol relationship and the state relationship when the model is run by using the stack as a structure, wherein the basic attributes of the symbol stack and the state stack are shown in table 4 below, in table 4, the left side is the english name of each attribute, and the right side is the chinese description or explanation corresponding to each english name.

TABLE 4 production base Attribute

top()	Stack top element
		pop()	Pop, number specified by parameter n
push()	Stacking elements specified by parameter x
		size()	Number of elements in stack

Further, analyzing each lexical symbol in the linear linked list in sequence, including:

when the current lexical symbol in the linear linked list is the lexical symbol representing the beginning of the scope, creating a corresponding stack top scope at the stack top of the scope stack, and writing the lexical symbol after the lexical symbol representing the beginning of the scope into the stack top scope; and when the current lexical symbol in the linear linked list is the lexical symbol representing the end of the scope, performing stack-top scope popping.

The scope of the scope stack at the bottom of the stack is a global scope, and the scopes except the scope at the bottom of the stack are local scopes.

Specifically, if the lexical symbol read from the linear linked list is the lexical symbol (e.g., ") indicating the beginning of the scope, a corresponding stack top scope is created at the stack top of the scope stack, and the lexical symbol (e.g., variable, keyword, etc.) after the lexical symbol indicating the beginning of the scope is written into the created stack top scope. If the lexical symbol (e.g., "{") representing the beginning of the scope is not first encountered in the process of analyzing the lexical symbol in the linear linked list, a new scope, i.e., a stack top scope, is created at the stack top of the scope stack at this time, the stack top scope is a local scope, and then the lexical symbol after the lexical symbol representing the beginning of the scope is written into the created stack top scope.

Further, if the lexical symbol read from the linear linked list is the lexical symbol (for example, "}") indicating that the scope is ended, and the stack top scope is popped, that is, the lexical symbol written into the stack top scope is read out, and the created stack top scope is deleted.

Further, in the C + + programming language, namespaces, types, and objects are semantically organized in terms of "scopes", which may be in a nested relationship. Namespaces, types, function bodies, etc. have their scope, where identifiers in different scopes may be renamed, and type names and object names in the same scope may also be renamed. The scope is analyzed by stacks in the embodiment of the application, wherein the scope stacks are important reference bases for dynamic part-of-speech determination.

Further, the scope can be divided into a namespace scope, a class scope, a function body scope, etc., different scopes have their specific attributes, and the general attributes of the scope (scope) are shown in table 5 below, in table 5, the left side is the english name of each attribute, and the right side is the chinese description or explanation corresponding to each english name.

TABLE 5-generic Properties of scopes

in practical applications, the first lookup table and the second lookup table of the predetermined programming language may be generated in advance based on analyzing the grammar of the predetermined programming language. Subsequently, when the code file to be analyzed of the predetermined programming language is generated, each lexical symbol in the linear linked list corresponding to the code file to be analyzed can be sequentially analyzed based on the pre-generated first lookup table and the second lookup table, so that a syntax tree of the code file to be analyzed is generated, and the code file to be analyzed can be statically analyzed through the syntax tree.

Each lexical symbol in the linear linked list has a corresponding part of speech, and in the process of sequentially analyzing each lexical symbol in the linear linked list based on the first lookup table and the second lookup table, the first lookup table and the second lookup table are actually searched according to the corresponding part of speech of each lexical symbol. The first lookup table determines the operation or action correspondingly executed after the lexical symbol is read in according to the part of speech of the lexical symbol, and the first lookup table nests and calls the second lookup table. Therefore, before searching based on the first lookup table and the second lookup table, the part of speech of the read-in lexical symbol needs to be determined, so that the first lookup table and the second lookup table are searched according to the part of speech of the lexical symbol, and a corresponding search result is obtained.

If the lexical symbol read currently is "int", the part of speech is determined to be "type name" by the part of speech, and then the first lookup table and the second lookup table can be searched according to the part of speech "type name" of the lexical symbol "int", so as to obtain the corresponding search result.

Specifically, determining the part-of-speech of any lexical symbol in the linear linked list includes:

judging the part of speech of any part of speech symbol according to the context, and determining the part of speech of any part of speech symbol; and/or the presence of a gas in the gas,

and performing part-of-speech guessing on any part-of-speech symbol according to the context to determine the part-of-speech of any part-of-speech symbol.

Specifically, part-of-speech determination is performed on any part-of-speech symbol according to the context, and the part-of-speech of any part-of-speech symbol is determined, including any of the following situations:

when any lexical symbol is any one of a type name, an object name and a keyword, the lexical symbol is taken as the part of speech of the lexical symbol;

when any lexical symbol is a lexical symbol of a preset type, judging that any lexical symbol is a template list symbol or an operation symbol, and taking a judgment result as the part of speech of any lexical symbol;

and searching information in the action domain, and taking the searching result as the part of speech of any lexical symbol.

Specifically, the information search is performed in the scope of action, and the search result is used as the part of speech of any lexical symbol, including:

performing horizontal search in the action domain, and taking the search result of the horizontal search as the part of speech of any lexical symbol;

the horizontal search is to search in the current scope and the name space of the reference, or search in the base class scope of the current scope.

Specifically, the method further comprises the following steps:

if the search result is not searched through horizontal search, performing longitudinal search in the action domain, and taking the search result of the longitudinal search as the part of speech of any lexical symbol;

a vertical search is a search within a scope that includes the current scope.

Specifically, performing part-of-speech guessing on any part-of-speech symbol according to context to determine the part-of-speech of any part-of-speech symbol, including:

guessing whether the part of speech of any lexical symbol is a name space name or not according to the context;

if the guess is not a name of the namespace, and when the part of speech of any lexical symbol is guessed to be a type name and not an object name, if the first lexical symbol after any lexical symbol is not a preset type lexical symbol, the part of speech of any lexical symbol is determined to be the type name, and if the first lexical symbol after any lexical symbol is the preset type lexical symbol, the part of speech of any lexical symbol is determined by performing first preset processing according to any lexical symbol.

Specifically, the method further comprises:

if the guess is not a namespace name, and when the part of speech of any lexical symbol is guessed to be an object name and not a type name, it is determined that the part of speech of any lexical symbol is an object name.

Specifically, the method further comprises:

if the name is not a name of the namespace, and when the part of speech of any lexical symbol is guessed to be the object name and to be the type name, whether any lexical symbol is the type name is determined according to the context of any lexical symbol, if the name is the type name, the part of speech of any lexical symbol is determined to be the type name, and if the name is not the type name, the part of speech of any lexical symbol is determined by performing second preset processing according to any lexical symbol.

The following is a detailed description of specific contents related to the present embodiment:

in the grammar according to the embodiment of the present application, since a TYPE NAME (for example, a class NAME, a structure NAME, a complex NAME, an enumerated class NAME, and the like) and an object NAME (for example, a variable NAME, a function NAME, an object NAME of a custom class, and the like) are involved in the grammar specification as a terminator, it is necessary to determine whether the IDENTIFIER is a TYPE NAME (TYPE-NAME) or an object NAME (IDENTIFIER) when reading the IDENTIFIER. The information of the current lexical symbol can be searched from the lexical symbol read in before and the constructed syntax tree, so that whether the current lexical symbol is a type name or an object name is judged.

For example the following code:

keywords introduced after C + +98, such as final and override, may be used as object names, and it is necessary to determine whether or not the keywords are keywords according to the context during parsing. In other words, when any lexical symbol is any one of the type name, the object name, and the keyword, it is regarded as the part of speech of any lexical symbol. That is, if any lexical symbol is a type name, the type name is taken as the part of speech of the any lexical symbol, if any lexical symbol is an object name, the object name is taken as the part of speech of the any lexical symbol, and if any lexical symbol is a keyword, the keyword is taken as the part of speech of the any lexical symbol.

Another example is the following code:

for the lexical symbols of the preset types such as "<", ">", and the like, whether the lexical symbols are the beginning or the end of the template list or are greater than or less than the number needs to be judged, and for the lexical symbols of the preset type of ">", whether the lexical symbols are shift operators or two template list end characters are written together needs to be judged, if the lexical symbols are written together, the "> >" needs to be replaced by two ">" in a linear linked list (token-list). In other words, when any lexical symbol is a predetermined type of lexical symbol, it is determined that the lexical symbol is a template list symbol or an operation symbol, and the determination result is used as the part of speech of the lexical symbol.

In table 6, the left side of the table 6 is the english name of each special terminal, and the right side is the chinese description or explanation corresponding to each special terminal.

TABLE 6-all special terminals that need to be lexical determined

In addition, because the syntax tree generated in the embodiment of the present application is used for static analysis of the code file, some incomplete codes are also acceptable. When the part of speech of some lexical symbols can not be determined, the part of speech of the symbols is guessed by using a part of speech guessing mechanism.

In other words, the overall process of determining the part-of-speech of any lexical symbol in the linear linked list may include the following three types of sub-processes: (1) determining the part of speech of keywords and operators; (2) searching and determining part of speech through scope information; (3) part of speech is determined by part of speech guessing. The following specific examples of (1) and (2) above will be described, and the following example of (3) above will be described in another implementation manner.

For the above (1) and (2), for example, the following codes are used:

struct A

{}；

int A＝0；

int fun()

{

struct A a；//#1

return A；//#2

}

in this example, when analyzing "a" at #1, a scope information lookup is performed to find that "a" is both a class NAME and a variable NAME, and the keyword "struct" at #1 is limited to only accept the class NAME, so the part of speech of "a" at #1 is TYPE-NAME (TYPE NAME). When analyzing "a" at #2, it is found that the currently acceptable class name and the variable name are preferentially calculated by the variable name, and thus the part of speech "a" at #2 is identified as IDENTIFIER (variable name). When the part-of-speech information cannot be acquired through the keyword and operator part-of-speech determination and scope information search, a part-of-speech guess mechanism is started, namely the part-of-speech is determined through part-of-speech guess. Fig. 3 is a schematic diagram of a process for determining parts of speech according to the above code example.

In addition, special handling is required for member functions defined in a class declaration, because the member functions may be implemented prior to declaration of class members, such as the following code:

the declaration of the member variable m is later than the definition of the member function fun, which is legal, but if the m in the function body of the fun is directly analyzed, the part-of-speech information cannot be acquired, so that when the { 'after the fun ()' is read in, the function body is transferred to the outside of the class declaration and is replaced by the function body; ", the function declaration information after the branch is stored in the terminal" TRIMED-DECL-INFO ". Fig. 4 is a schematic diagram of the process of obtaining parts of speech of keywords and operators.

In addition, scope information search is divided into "find-translate" and "find" in the vertical direction. "horizontal search" refers to search in the current scope and the referenced namespace or base class scope of the current scope, and "vertical search" refers to search in the scope containing the current scope when the result is not found. FIG. 5 is a diagram illustrating a process of determining part of speech by searching information in the scope.

The process of determining the part of speech of any lexical symbol by performing part of speech guessing on any lexical symbol according to context comprises the following key subprocesses: (1) like-simple-type-name: through the context of the symbols, it is determined whether the part of speech can be TYPE-NAME (TYPE NAME) or not, and the case of "<" after the lexical symbol (token) is not determined. (2) match-parameter-type-name: when the lexical symbol (token) is followed by "<", it is determined whether the "<" is the start of the template parameter list. (3) find-template-list-end: if "<" is the beginning of the template parameter list, returning the corresponding ">". (4) handle-stick-template-list-end: the template parameter list terminator ">", which is written together, is processed.

Wherein, fig. 6 shows a schematic diagram of a process of determining part of speech of a lexical symbol by means of part of speech guessing, x in the accept (x) function of fig. 6 represents part of speech (such as type name, object name, etc.), the accept (x) function represents whether the lexical symbol token with x as part of speech can be currently accepted by the model, and the accept (x) function is equivalent to Action (state-state. Top () represents the top of the state stack. In addition, the acc-tp and acc-id in fig. 6 are characters representing variable names, the first preset process is the above-mentioned find-template-list-end process and handle-stick-template-list-end process, and the second preset process is the above-mentioned match-parameter-type-name process and handle-stick-template-list-end process, which are not described herein again.

after the collision elimination processing is performed on any lexical symbol, the method further comprises the following steps: and performing error recovery processing on the processing result of the collision elimination processing.

When a first table entry corresponding to any lexical symbol and the stack top state of the current state stack in the first lookup table is empty, performing error recovery processing on any lexical symbol; and generating a syntax tree of the code file to be analyzed according to the processing result of the error recovery processing. And then generating a syntax tree according to the result after the error recovery processing.

Specifically, before performing error recovery processing on any lexical symbol, the method further includes:

and generating prompt information for reminding the user that any lexical symbol is wrong, and displaying the prompt information.

Specifically, the error recovery processing of any lexical symbol includes:

determining at least one group of target lexical symbols, target states and target non-terminal symbols in a preset combination form meeting preset recovery conditions;

and performing error recovery processing on any lexical symbol based on any group of target lexical symbols in a preset combination form, a target state and a target non-terminal character.

Specifically, the predetermined recovery condition is that a third table entry corresponding to the target lexical symbol and the second table entry in the first lookup table is non-null;

the second table entry is a table entry corresponding to the target state and the target non-terminal character in the second lookup table.

Specifically, determining at least one group of target lexical symbols, target states and target non-terminal symbols in a predetermined combination form satisfying a predetermined recovery condition includes:

skipping a preset number of states from the stack top state of the current state stack as a target state;

skipping a predetermined number of lexical symbols from any lexical symbol as a target lexical symbol;

and determining the non-terminal character corresponding to the target state as the target non-terminal character based on the corresponding relation between the state and the non-terminal character in the second lookup table.

Specifically, the error recovery processing of any lexical symbol based on any group of target lexical symbols in a predetermined combination form, a target state and a target non-terminal character includes:

in the state stack, performing pop processing on the state above the target state;

in the symbol stack, pop processing is carried out on the lexical symbols positioned above any lexical symbol, and the number of the popped lexical symbols is the same as the number of popped states in the state stack;

and pressing the target non-terminal character into the symbol stack and pressing the second table entry into the state stack to carry out error recovery processing on any lexical symbol.

Specifically, the method further comprises the following steps:

when a plurality of groups of target lexical symbols in a preset combination form, target states and target non-terminal characters meet preset recovery conditions, carrying out error recovery processing on any lexical symbol based on the first group of target lexical symbols in the preset combination form, the target states and the target non-terminal characters meeting the preset recovery conditions; or, based on the target lexical symbol, the target state and the target non-terminal character of the preset combination form with the highest priority in the plurality of groups of preset combination forms, carrying out error recovery processing on any lexical symbol.

The error recovery process described above is described in detail below:

when the read lexical symbol (token) and the current state stack top state correspond to an empty entry in a first lookup table (i.e., an Action table), that is, an Action (state-stack. top (), token.cat) is (NULL ), it indicates that the lexical symbol (token) is not accepted by the current model and a syntax error exists, and such lexical symbol is called error-token (wrong lexical symbol).

Wherein, it can search downwards from the stack top of the current state stack to find a state s, search backwards from the current input symbol sequence to find a lexical symbol t, and then find a non-terminal character a in the second lookup table goto(s), and these targets (s, a, t) to be searched must meet the following conditions: action (Goto (s, A), t)! This condition may be referred to as a "recovery condition," NULL (i.e., the entry found from the first lookup table is not empty). After (s, A, t) meeting the recovery condition is found, all the states above s are popped, the same number of symbols are popped from the symbol stack, A is pushed into the symbol stack, and Goto (s, A) is pushed into the state stack, so that the error can be skipped, and the process of skipping the error is called error recovery.

There may be multiple results for the search of (s, a, t), error recovery may be performed according to the result when the first result is found, this mode may be referred to as "panic mode", or multiple results may be compared to select an optimal (e.g., highest priority) result, and error recovery may be performed according to the optimal result, this mode may be referred to as "preferred mode".

It should be noted that the embodiments of the present application divide the errors that occur into "code structural errors" and "common syntax errors". The "structural error of the code" is the most serious error, such as an error without pairing brackets, which is not included when encountering such an error, and an error is directly reported and the execution is finished, while a general syntax error is included to a certain extent.

According to the above description about error recovery, when an error symbol "error-token" is encountered, a group (s, a, t) of (s, a, t) can be found for error recovery (s is a state, a is a non-terminal, t is an error-token or some symbol after it), in the embodiment of the present application, all (s, a, t) that can be error-recovered are found first, and an optimal result (which is recorded as a result with the highest priority) is selected according to a "preferred mode", and if the optimal result cannot be found, a "panic mode" is entered. Wherein, based on the following preferred principle, the selection of the optimal result (i.e. the result with the highest priority) is performed:

1. the search of t does not jump out of the current scope as much as possible and does not search in the scope nested in the current scope, so that the scope is not disturbed as much as possible, and the codes behind the error codes can be correctly processed.

2. s should be as close as possible to the top of the state stack, and t should also be as close as possible to the error-token, so that code information can be discarded as less as possible.

3. A should select the non-terminal that can represent the main program structure, i.e. the one that is near the root node in the syntax tree, such as expression (expression), statement (expression-state, declaration-state), etc. If A is a non-terminal near a leaf node, there is a greater likelihood of introducing a new error.

Fig. 7 shows a schematic process diagram of error recovery processing, where a in fig. 7 represents a target non-terminal character, t represents a target lexical symbol, get-tolerant-end-token represents a search boundary for determining t, and some symbols in the stack may use scope boundary symbols as attributes and scope boundaries as search boundaries of t. In addition, if an unpaired left bracket ("{", "(", "[") is found by searching from the top of the symbol stack to the bottom of the stack, the search boundary of t is the right bracket paired with the left bracket.

The SYMBOL-WEIGHT-MAP is a 'name-value' binary mapping table, records all non-terminal names which can be selected and corresponding WEIGHTs, the WEIGHTs reflect the high or low of the syntax level of the non-terminal, namely the distance relation with a root node, the values of the WEIGHTs are empirical values, and the larger the WEIGHT is, the closer the leaf node to the syntax tree is. If "translation-unit" is the starting symbol, i.e., the root of the syntax tree, and its weight is 1, "translation-unit" is composed of "declaration-seq", so "declaration-seq" has a weight of 2. The significant non-terminal characters in the "name-value" binary mapping table, SYMBOL-WEIGHT-MAP, are listed in table 7 below, in table 7, the left side is the english name of the significant non-terminal character, and the right side is the chinese description and the WEIGHT corresponding to the significant non-terminal character, wherein the parenthesis is the WEIGHT.

TABLE 7 Chinese description and weightings for important non-terminal characters

translation-unit	Start symbol, representing the entire coding unit (1)
		declaration-seq	Statement sequence, consisting of statements (2)
declaration	Statement (2.1)
		function-body	Function body (4)
namespace-body	Namespace declaration body (4)
		class-body	Similar type declaration body (4)
enum-body	Enumeration declaration body (4)
		simple-declaration	Simple statement, non-template statement (4)
member-simple-declaration	Class member simple statement (5.2)
		template-member-declaration	Stencil class member statement (5.2)
compound-statement	Compound sentence (7.2)
		expression-statement	Expression statement (7.3)
declaration-statement	Statement (7.3)
		selection-statement	Selection statements, e.g. if-else statements, switch statements (7.3)
iteration-statement	Cyclic sentences, e.g. for, while, do-while, etc. (7.3)
		jump-statement	Jump statements, such as return, goto statement 7.3,
labeled-statement	label statements, such as case statement and custom Label statement (7.3)
		condition	Condition (7.6)
template-parameter-list	Stencil shape parameter list (7.6)
		template-argument-list	Stencil real ginseng list (7.6)
template-argument	Stencil solid ginseng (8)
		exception-declaration	Exception (8.1)
expression	Expression (9)
		expression-list	Expression list (9)
assignment-expression	Assignment expression (9.1)
		constant-expression	Constant expression (9.2)
conditional-expression	Conditional expression (9.3)
		logical-or-expression	Logic or expression (9.4)
logical-and-expression	Logic and expression (9.5)
		declarator	Declaration symbol (14)

The panic mode is similar to the preferred mode in general flow, but is not limited by the SYMBOL-WEIGHT-MAP table, and the result is not compared, and error recovery is performed as long as (s, A, t) meeting the recovery condition is found. While panic mode must recover from errors, it is possible that the scope is disturbed and a large amount of code information is discarded.

fig. 8 is a schematic diagram of a process of sequentially analyzing each lexical symbol in the linear linked list based on the first lookup table and the second lookup table to generate a syntax tree of the code file to be analyzed according to the embodiment of the present application, where the process includes each sub-process of determining a part of speech, searching the first lookup table and the second lookup table, performing collision elimination processing, performing error recovery processing, and the like. In fig. 8, "token" represents a lexical symbol read currently, "token-list.first" represents a first symbol of a linear linked list, "state-stack.top ()" represents a state found according to a second lookup table, "token.cat" represents a part of speech of the lexical symbol read currently, a value of an Action (state, cat) may be represented as a binary group (m, x), a value of m is R representing a reduction behavior, a value of m is S representing a move behavior, a value of the state represents a state, cat represents a part of speech, and token is token.next () represents a lexical symbol after the lexical symbol as the current lexical symbol.

Further, the shift-in process is indicated by a first lookup table (Action table), and when the read-in symbol and the related state are shifted into the symbol stack and the state stack, the "shift-in process" is executed, as shown in fig. 9, fig. 9 gives an example of a lexical symbol (token), and details a processing flow of the shift-in process. If the lexical symbol (token) is read currently and is related to the scope, the scope operation is firstly carried out, then the lexical symbol (token) is converted into the grammatical symbol to be moved into the symbol stack, and the new state is moved into the state stack, and the process is the 'move-in process'. The symbols associated with scopes are "{", "}", "if", "for", "while", "switch", "catch", and the like. For example, when a lexical symbol (token) is read as "{", the type and operation of the relevant scope need to be determined according to the symbol stack and the scope stack: (1) if the symbol stack top is ' namespaced-head ', it is indicated that ' { ' is the start of namespace declaration, if the symbol stack top is ' class-head ', it is indicated that ' { ' is the start of class member declaration, it is necessary to create a related scope, and push a new scope into the scope stack, when reading in the ' corresponding ' } ', it is indicated that the new scope is completed, and at this time, the new scope is popped out. (2) If the symbol stack top is symbols such as "resolver", "constractor", "destructor", "contractor", etc., it indicates that "{" is the beginning of function definition and the related operation of function body scope needs to be performed. (3) If the symbol stack top is not within the expectation and the current scope stack top is the function body scope, it means that "{" is the beginning of the compound-statement "compound-status", and the compound statement also has its own scope.

Since the C + +0x standard allows for the definition of new objects in the condition or initialization process of if, while, for, etc. statements, it is necessary to create scopes when reading in such symbols.

For example the following code:

when "if" is read, a new scope (set as "if-scope") is created, and when "a" in "if (int a ═ x)" is associated with a generated formula specification, "a" variable definition information is written into "if-scope", and when "a" in "if-else" statement is analyzed later, it can be found that it is a local variable by "part of speech determination". In the "if" statement, "{ a ═ 1; "still as a scope embedded" if-scope ". Can be represented by the following formula: selection-status → if (condition) status reduction pops "if-scope" off the stack.

For "(", sometimes also related to scope, for example the following code:

when the member function of the class is realized outside the class declaration scope, the scope in which the parameter list is positioned is the class declaration scope, when analyzing the type in the int A: (type p), the part of speech determination can directly obtain the type name defined in the type A, and the member function is not required to be written into the int A: (type p) according to the C + + standard.

Further, the reduction process is indicated by an Action table, and the symbol sequence at the top of the symbol stack can be reduced according to a certain formula, so that the reduction process is executed. And (3) the symbol sequence belonging to the right part of the production formula is popped from the symbol stack, the symbol at the left part is popped, the symbol sequence at the right part is recorded in the symbol at the left part, and after all reduction processes are executed, all symbols form a syntax tree. Where the attributes of each symbol also need to be recorded or derived in the specification process for use by the various mechanisms in the embodiments of the present application.

Example two

Fig. 10 is a schematic structural diagram of an apparatus for generating a syntax tree of a code file according to an embodiment of the present disclosure, as shown in fig. 10, the apparatus 100 may include a parsing module 101, a first processing module 102, and a syntax tree generating module 103, where:

the parsing module 101 is configured to, when a code file to be parsed in a predetermined programming language is received, parse each lexical symbol in the code file to be parsed through the lexical parsing module and generate a corresponding linear linked list;

the first processing module 102 is configured to analyze each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and perform corresponding conflict elimination processing on any lexical symbol when determining that any lexical symbol belongs to a predetermined conflict type;

the syntax tree generating module 103 is configured to generate a syntax tree of the code file to be parsed according to a processing result of the conflict resolution processing.

Specifically, the predetermined conflict type includes any one of:

Further, a storage module 104 is also included, as shown in fig. 11, wherein:

the storage module 104 is configured to store the current processing state to obtain a first storage result.

Further, the first processing module includes a first processing sub-module 1021 and a second processing sub-module 1022, as shown in fig. 11, wherein:

the first processing submodule 1021 is used for performing first target processing on any lexical symbol according to the context, and sequentially performing corresponding processing on the lexical symbols behind the any lexical symbol based on the first target processing;

the second processing sub-module 1022 is configured to delete the first saved result and continue to perform corresponding processing on subsequent lexical symbols when no processing error occurs until the retry ending symbol is processed.

Further, the first processing module 102 includes a third processing submodule 1023, as shown in fig. 11, wherein:

the third processing sub-module 1023 is configured to, when a processing error occurs during the corresponding processing of the lexical symbols subsequent to any of the lexical symbols in sequence, perform recovery processing according to the first preservation result, perform second target processing on any of the lexical symbols, and perform corresponding processing of the lexical symbols subsequent to any of the lexical symbols in sequence.

Further, a second processing module 105 is also included, as shown in fig. 11, wherein:

the second processing module 105 is configured to perform error recovery processing on the processing result of the collision elimination processing.

Further, the first processing module 102 includes a part-of-speech determination sub-module 1024 and an analysis sub-module 1025, wherein:

the part-of-speech determination submodule 1024 is configured to sequentially perform part-of-speech determination on each lexical symbol in the linear linked list, and when the part-of-speech of any one of the lexical symbols is determined, search the first lookup table and the second lookup table according to the part-of-speech of any one of the lexical symbols, and obtain a corresponding search result;

the analysis sub-module 1025 is configured to analyze any lexical symbol according to the search result.

Compared with the prior art, the device provided by the embodiment of the application analyzes each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performs corresponding conflict elimination processing on any lexical symbol when determining that any lexical symbol belongs to a preset conflict type, so that the situation that a plurality of possible processing behaviors cannot be determined to be selected according to the lookup table is effectively solved, and the necessary basis for subsequently generating the syntax tree of the code file to be analyzed is laid; according to the processing result of the conflict elimination processing, the syntax tree of the code file to be analyzed is generated, and the method for generating the syntax tree of the code file is provided, so that the compiled code file can be statically analyzed through the syntax tree, syntax errors, compiling errors and the like in the compiled code file can be accurately and efficiently checked and corrected, and time, energy and the like of a program developer are greatly saved.

EXAMPLE III

An embodiment of the present application provides an electronic device, as shown in fig. 12, an electronic device 1200 shown in fig. 12 includes: a processor 1201 and a memory 1203. Wherein the processor 1201 is coupled to the memory 1203, such as by a bus 1202. Further, the electronic device 1200 may also include a transceiver 1204. It should be noted that the transceiver 1204 is not limited to one in practical applications, and the structure of the electronic device 1200 is not limited to the embodiment of the present application.

In this embodiment, the processor 1201 is applied to implement the functions of the parsing module, the first processing module and the syntax tree generating module shown in fig. 10 or 11, and the functions of the storage module and the second processing module shown in fig. 11.

The processor 1201 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1201 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 1202 may include a path that conveys information between the aforementioned components. The bus 1202 may be a PCI bus or an EISA bus, etc. The bus 1202 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

The memory 1203 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 1203 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 1201. The processor 1201 is configured to execute application program code stored in the memory 1203 to implement the actions of the apparatus for generating a syntax tree of a code file provided by the embodiment shown in fig. 10 or 11.

The electronic device provided by the embodiment of the application comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the program, compared with the prior art, the electronic device can realize that: on the basis of the first lookup table and the second lookup table, analyzing each lexical symbol in the linear linked list in sequence, and when determining that any lexical symbol belongs to a preset conflict type, performing corresponding conflict elimination processing on any lexical symbol, thereby effectively solving the problem that a plurality of possible processing behaviors can not be determined to select according to the lookup table, and laying the necessary foundation for subsequently generating a syntax tree of the code file to be analyzed; according to the processing result of the conflict elimination processing, the syntax tree of the code file to be analyzed is generated, and the method for generating the syntax tree of the code file is provided, so that the compiled code file can be statically analyzed through the syntax tree, syntax errors, compiling errors and the like in the compiled code file can be accurately and efficiently checked and corrected, and time, energy and the like of a program developer are greatly saved.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method shown in the first embodiment. Compared with the prior art, each lexical symbol in the linear linked list is sequentially analyzed based on the first lookup table and the second lookup table, and when any lexical symbol is determined to belong to a preset conflict type, corresponding conflict elimination processing is performed on any lexical symbol, so that the problem that a plurality of possible processing behaviors cannot be determined to be selected according to the lookup table is effectively solved, and the necessary basis for subsequently generating a syntax tree of a code file to be analyzed is laid; according to the processing result of the conflict elimination processing, the syntax tree of the code file to be analyzed is generated, and the method for generating the syntax tree of the code file is provided, so that the compiled code file can be statically analyzed through the syntax tree, syntax errors, compiling errors and the like in the compiled code file can be accurately and efficiently checked and corrected, and time, energy and the like of a program developer are greatly saved.

The computer-readable storage medium provided by the embodiment of the application is suitable for any embodiment of the method. And will not be described in detail herein.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method of generating a syntax tree for a code file, comprising:

analyzing each lexical symbol in the linear linked list in sequence based on a first lookup table and a second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when the lexical symbol is determined to belong to a preset conflict type;

2. The method of claim 1, wherein the predetermined collision type comprises any one of:

the processing of any lexical symbol belongs to the conflict between the shift-in processing and the reduction processing;

the processing of any lexical symbol is a conflict between the first reduction processing and the second reduction processing.

3. The method of claim 2, further comprising, prior to performing collision mitigation processing on any of the lexical symbols:

and saving the current processing state to obtain a first saving result.

4. The method of claim 3, wherein performing collision mitigation on any of the lexical symbols comprises:

performing first target processing on any lexical symbol according to context, and sequentially performing corresponding processing on the lexical symbols behind the any lexical symbol based on the first target processing;

and if no processing error occurs until the processing of the retry end symbol is finished, deleting the first preservation result and continuing to correspondingly process the subsequent lexical symbols.

5. The method of claim 4, further comprising:

and if processing errors occur in the process of sequentially and correspondingly processing the lexical symbols after the any lexical symbol, performing recovery processing according to the first preservation result, performing second target processing on the any lexical symbol, and sequentially and correspondingly processing the lexical symbols after the any lexical symbol.

6. The method according to any one of claims 1 to 5, wherein the case of performing collision elimination processing on any lexical symbol includes any one of:

when any lexical symbol belongs to a first preset type, if the lexical symbol conflicts between reduction processing and shift processing, the shift processing is carried out on the lexical symbol;

when any lexical symbol belongs to a second preset type, if any lexical symbol conflicts between shift-in processing and reduction processing, determining to perform shift-in processing or reduction processing on any lexical symbol according to a linear relation between the lexical symbols;

and when any lexical symbol belongs to a third preset type, if the lexical symbol conflicts between the first reduction processing and the second reduction processing, determining to perform the first reduction processing or the second reduction processing on the lexical symbol according to the linear relation between the lexical symbols.

7. The method according to any one of claims 1-6, further comprising, after performing collision mitigation processing on any of the lexical symbols:

and performing error recovery processing on the processing result of the conflict elimination processing.

8. An apparatus for generating a syntax tree for a code file, comprising:

the first processing module is used for analyzing each lexical symbol in the linear linked list in sequence based on the first lookup table and the second lookup table, and performing corresponding conflict elimination processing on any lexical symbol when the lexical symbol is determined to belong to a preset conflict type;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of generating a syntax tree of a code file according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which program, when being executed by a processor, carries out the method of generating a syntax tree of a code file according to any one of claims 1 to 7.