CN114546404A

CN114546404A - Code annotation rate statistical method based on lexical analysis technology

Info

Publication number: CN114546404A
Application number: CN202210168613.0A
Authority: CN
Inventors: 何军
Original assignee: Beijing Simple Technology Co ltd
Current assignee: Beijing Simple Technology Co ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-05-27

Abstract

The invention provides a code annotation rate statistical method based on a lexical analysis technology, which comprises the steps of loading a source code, analyzing the source code based on a lexical analysis tool, and obtaining a lexical symbol list; each lexical symbol contains information: the lexical type, the lexical text content, and the line number and column number information of the lexical text content; the lexical types comprise keywords, block annotations, line annotations, character strings, brackets and addition and subtraction operators; traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block annotation or a line annotation, performing the next analysis; the invention analyzes the code annotation rate by means of the lexical symbol information in the lexical analysis stage without passing through the syntax analysis stage, namely, without generating an abstract syntax tree and traversing the syntax tree, and can improve the analysis speed and save the memory space compared with the analysis method based on the abstract syntax tree.

Description

Code annotation rate statistical method based on lexical analysis technology

Technical Field

The invention relates to the field of code analysis, in particular to a code annotation rate statistical method based on a lexical analysis technology.

Background

Significance of code annotation rate statistics:

the code annotation rate is one of important indexes for measuring maintainability of the code engineering, and the code annotation can help developers to review historical codes and help other maintainers to understand the meaning of the codes, so that the maintenance cost of the project codes is saved. In many excellent open source projects, the core module often has more annotation code than program code.

The existing code annotation rate statistical method comprises the following steps:

1. based on text matching: matching code lines// beginning,// annotated code block matching, etc.;

2. statistics based on abstract syntax trees in combination with elimination: the abstract syntax tree only contains valid code node information and does not contain comment codes and information such as blank lines and spaces. When parsing the syntax tree, the number of valid code lines can be counted. In addition, the total code line and the empty code line of the file can be obtained based on text analysis, and then the total code line and the empty code line can be obtained by calculation according to a formula:

the comment line is total code line-empty line-valid code line;

the advantages and disadvantages of the above method are:

the method comprises the following steps: based on text matching, the method is simple and fast, but statistics is not accurate enough, and annotations in the middle or at the tail of a line cannot be processed accurately because position information of code symbols is lacked for auxiliary analysis;

the method 2 comprises the following steps: compared with the method 1, the accuracy is improved, but an abstract syntax tree needs to be processed, the memory overhead is large, the efficiency is relatively low, and the statistics cannot be accurately distinguished aiming at the annotation codes in the middle of the line and the file header annotation.

Disclosure of Invention

The invention aims to provide a code annotation rate statistical method based on a lexical analysis technology, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a code annotation rate statistical method based on a lexical analysis technology comprises the following steps:

loading a source code, and analyzing the source code based on a lexical analysis tool to obtain a lexical symbol list;

each lexical symbol contains information: the lexical type, the lexical text content, and the line number and column number information of the lexical text content;

the lexical types comprise keywords, block annotations, line annotations, character strings, brackets and addition and subtraction operators;

step two, traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block annotation or a line annotation, performing the next analysis;

step three, if the lexical symbol before the lexical symbol is not the conventional code symbol, the lexical symbol is a file head annotation, and the part of annotation is a function description of the whole file, so that the annotation is different from other code block annotations;

if the line number of the lexical symbol is equal to the line number of the next lexical symbol, the lexical symbol is positioned in the middle of a code line, and the line is regarded as an annotation line;

step four, if a certain line has a plurality of annotation types, the line only needs to be used as an annotation code;

step five, traversing the lexical symbol list to obtain all annotation lines of the code file; the total line number of the codes can be obtained through the line number information of the last lexical symbol, so that the code line annotation rate of the code file is accurately obtained, and the code line annotation rate of the whole project is calculated.

Compared with the prior art, the invention has the beneficial effects that:

after the method is adopted, the code annotation rate is analyzed by means of the lexical symbol information in the lexical analysis stage, and the abstract syntax tree and the traversal syntax tree do not need to be generated in the syntax analysis stage, so that the analysis speed can be increased and the memory space can be saved compared with the analysis method based on the abstract syntax tree. According to the method, common block annotations and line annotations can be identified, file header annotations and code line intermediate annotations can be accurately identified, and the accuracy of code annotation rate statistics is further improved.

Drawings

Fig. 1 is a flowchart of a code annotation rate statistical method based on lexical analysis technology.

Fig. 2 is a flowchart of lexical analysis in a code annotation rate statistical method based on the lexical analysis technique.

Detailed Description

The technical solution of the present patent will be described in further detail with reference to the following embodiments.

Referring to fig. 1-2, a method for counting a code annotation rate based on a lexical analysis technique includes the following steps:

loading a source code, and analyzing the source code based on a lexical analysis tool to obtain a lexical symbol stream, namely a lexical symbol list;

each lexical symbol contains information: the information of the lexical type, the lexical text content, the line number and the column number of the lexical text content;

wherein, the lexical type comprises keywords, block annotations, line annotations, character strings, brackets, addition and subtraction operators and the like;

step two, traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block Comment (Comment) or a Line Comment (Line Comment), performing the next analysis;

step three, if the lexical symbol before the lexical symbol is not the conventional code symbol, the lexical symbol is represented as a file header annotation (HeadDoc), and the part of annotation is a functional description of the whole file, so that the annotation is different from annotations of other code blocks;

Finally, based on the analyzed data, the code annotation rate can be calculated by combining an annotation rate analysis formula, and if the file header annotation is not required to be added into the code annotation rate, the header annotation can be removed as required for calculation.

In the embodiment, the "lexical analysis" is a stage of code from text to generation of a machine executable program, and the specific flow is shown in fig. 2.

The working principle of the invention is as follows: the invention is based on the lexical analysis technology to carry out statistics: performing lexical analysis on the source code by using a lexical analysis tool to obtain a series of lexical symbols (Token):

each lexical symbol at least comprises the following information:

1. lexical type (keywords, block comments, line comments, character strings, brackets, addition and subtraction operators, etc.)

2. Lexical text content

3. Information of row number, column number, etc

Based on the obtained lexical symbol information, a symbol list with the type of block annotation or line annotation can be filtered through traversal analysis, and an annotation code line can be identified and the code annotation rate can be counted according to the line information of the lexical symbol.

Although the preferred embodiments of the present patent have been described in detail, the present patent is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present patent within the knowledge of those skilled in the art.

Claims

1. A code annotation rate statistical method based on a lexical analysis technology is characterized by comprising the following steps: