CN114546404A - Code annotation rate statistical method based on lexical analysis technology - Google Patents

Code annotation rate statistical method based on lexical analysis technology Download PDF

Info

Publication number
CN114546404A
CN114546404A CN202210168613.0A CN202210168613A CN114546404A CN 114546404 A CN114546404 A CN 114546404A CN 202210168613 A CN202210168613 A CN 202210168613A CN 114546404 A CN114546404 A CN 114546404A
Authority
CN
China
Prior art keywords
lexical
annotation
code
line
symbol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210168613.0A
Other languages
Chinese (zh)
Inventor
何军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Simple Technology Co ltd
Original Assignee
Beijing Simple Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Simple Technology Co ltd filed Critical Beijing Simple Technology Co ltd
Priority to CN202210168613.0A priority Critical patent/CN114546404A/en
Publication of CN114546404A publication Critical patent/CN114546404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a code annotation rate statistical method based on a lexical analysis technology, which comprises the steps of loading a source code, analyzing the source code based on a lexical analysis tool, and obtaining a lexical symbol list; each lexical symbol contains information: the lexical type, the lexical text content, and the line number and column number information of the lexical text content; the lexical types comprise keywords, block annotations, line annotations, character strings, brackets and addition and subtraction operators; traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block annotation or a line annotation, performing the next analysis; the invention analyzes the code annotation rate by means of the lexical symbol information in the lexical analysis stage without passing through the syntax analysis stage, namely, without generating an abstract syntax tree and traversing the syntax tree, and can improve the analysis speed and save the memory space compared with the analysis method based on the abstract syntax tree.

Description

Code annotation rate statistical method based on lexical analysis technology
Technical Field
The invention relates to the field of code analysis, in particular to a code annotation rate statistical method based on a lexical analysis technology.
Background
Significance of code annotation rate statistics:
the code annotation rate is one of important indexes for measuring maintainability of the code engineering, and the code annotation can help developers to review historical codes and help other maintainers to understand the meaning of the codes, so that the maintenance cost of the project codes is saved. In many excellent open source projects, the core module often has more annotation code than program code.
The existing code annotation rate statistical method comprises the following steps:
1. based on text matching: matching code lines// beginning,// annotated code block matching, etc.;
2. statistics based on abstract syntax trees in combination with elimination: the abstract syntax tree only contains valid code node information and does not contain comment codes and information such as blank lines and spaces. When parsing the syntax tree, the number of valid code lines can be counted. In addition, the total code line and the empty code line of the file can be obtained based on text analysis, and then the total code line and the empty code line can be obtained by calculation according to a formula:
the comment line is total code line-empty line-valid code line;
the advantages and disadvantages of the above method are:
the method comprises the following steps: based on text matching, the method is simple and fast, but statistics is not accurate enough, and annotations in the middle or at the tail of a line cannot be processed accurately because position information of code symbols is lacked for auxiliary analysis;
the method 2 comprises the following steps: compared with the method 1, the accuracy is improved, but an abstract syntax tree needs to be processed, the memory overhead is large, the efficiency is relatively low, and the statistics cannot be accurately distinguished aiming at the annotation codes in the middle of the line and the file header annotation.
Disclosure of Invention
The invention aims to provide a code annotation rate statistical method based on a lexical analysis technology, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a code annotation rate statistical method based on a lexical analysis technology comprises the following steps:
loading a source code, and analyzing the source code based on a lexical analysis tool to obtain a lexical symbol list;
each lexical symbol contains information: the lexical type, the lexical text content, and the line number and column number information of the lexical text content;
the lexical types comprise keywords, block annotations, line annotations, character strings, brackets and addition and subtraction operators;
step two, traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block annotation or a line annotation, performing the next analysis;
step three, if the lexical symbol before the lexical symbol is not the conventional code symbol, the lexical symbol is a file head annotation, and the part of annotation is a function description of the whole file, so that the annotation is different from other code block annotations;
if the line number of the lexical symbol is equal to the line number of the next lexical symbol, the lexical symbol is positioned in the middle of a code line, and the line is regarded as an annotation line;
step four, if a certain line has a plurality of annotation types, the line only needs to be used as an annotation code;
step five, traversing the lexical symbol list to obtain all annotation lines of the code file; the total line number of the codes can be obtained through the line number information of the last lexical symbol, so that the code line annotation rate of the code file is accurately obtained, and the code line annotation rate of the whole project is calculated.
Compared with the prior art, the invention has the beneficial effects that:
after the method is adopted, the code annotation rate is analyzed by means of the lexical symbol information in the lexical analysis stage, and the abstract syntax tree and the traversal syntax tree do not need to be generated in the syntax analysis stage, so that the analysis speed can be increased and the memory space can be saved compared with the analysis method based on the abstract syntax tree. According to the method, common block annotations and line annotations can be identified, file header annotations and code line intermediate annotations can be accurately identified, and the accuracy of code annotation rate statistics is further improved.
Drawings
Fig. 1 is a flowchart of a code annotation rate statistical method based on lexical analysis technology.
Fig. 2 is a flowchart of lexical analysis in a code annotation rate statistical method based on the lexical analysis technique.
Detailed Description
The technical solution of the present patent will be described in further detail with reference to the following embodiments.
Referring to fig. 1-2, a method for counting a code annotation rate based on a lexical analysis technique includes the following steps:
loading a source code, and analyzing the source code based on a lexical analysis tool to obtain a lexical symbol stream, namely a lexical symbol list;
each lexical symbol contains information: the information of the lexical type, the lexical text content, the line number and the column number of the lexical text content;
wherein, the lexical type comprises keywords, block annotations, line annotations, character strings, brackets, addition and subtraction operators and the like;
step two, traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block Comment (Comment) or a Line Comment (Line Comment), performing the next analysis;
step three, if the lexical symbol before the lexical symbol is not the conventional code symbol, the lexical symbol is represented as a file header annotation (HeadDoc), and the part of annotation is a functional description of the whole file, so that the annotation is different from annotations of other code blocks;
if the line number of the lexical symbol is equal to the line number of the next lexical symbol, the lexical symbol is positioned in the middle of a code line, and the line is regarded as an annotation line;
step four, if a certain line has a plurality of annotation types, the line only needs to be used as an annotation code;
step five, traversing the lexical symbol list to obtain all annotation lines of the code file; the total line number of the codes can be obtained through the line number information of the last lexical symbol, so that the code line annotation rate of the code file is accurately obtained, and the code line annotation rate of the whole project is calculated.
Finally, based on the analyzed data, the code annotation rate can be calculated by combining an annotation rate analysis formula, and if the file header annotation is not required to be added into the code annotation rate, the header annotation can be removed as required for calculation.
In the embodiment, the "lexical analysis" is a stage of code from text to generation of a machine executable program, and the specific flow is shown in fig. 2.
The working principle of the invention is as follows: the invention is based on the lexical analysis technology to carry out statistics: performing lexical analysis on the source code by using a lexical analysis tool to obtain a series of lexical symbols (Token):
each lexical symbol at least comprises the following information:
1. lexical type (keywords, block comments, line comments, character strings, brackets, addition and subtraction operators, etc.)
2. Lexical text content
3. Information of row number, column number, etc
Based on the obtained lexical symbol information, a symbol list with the type of block annotation or line annotation can be filtered through traversal analysis, and an annotation code line can be identified and the code annotation rate can be counted according to the line information of the lexical symbol.
Although the preferred embodiments of the present patent have been described in detail, the present patent is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present patent within the knowledge of those skilled in the art.

Claims (1)

1. A code annotation rate statistical method based on a lexical analysis technology is characterized by comprising the following steps:
loading a source code, and analyzing the source code based on a lexical analysis tool to obtain a lexical symbol list;
each lexical symbol contains information: the lexical type, the lexical text content, and the line number and column number information of the lexical text content;
the lexical types comprise keywords, block annotations, line annotations, character strings, brackets and addition and subtraction operators;
step two, traversing the lexical symbol list, and if the lexical type of the lexical symbol is a block annotation or a line annotation, performing the next analysis;
step three, if the lexical symbol before the lexical symbol is not the conventional code symbol, the lexical symbol is a file head annotation, and the part of annotation is a function description of the whole file, so that the annotation is different from other code block annotations;
if the line number of the lexical symbol is equal to the line number of the next lexical symbol, the lexical symbol is positioned in the middle of a code line, and the line is regarded as an annotation line;
step four, if a certain line has a plurality of annotation types, the line only needs to be used as an annotation code;
step five, traversing the lexical symbol list to obtain all annotation lines of the code file; the total line number of the codes can be obtained through the line number information of the last lexical symbol, so that the code line annotation rate of the code file is accurately obtained, and the code line annotation rate of the whole project is calculated.
CN202210168613.0A 2022-03-23 2022-03-23 Code annotation rate statistical method based on lexical analysis technology Pending CN114546404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210168613.0A CN114546404A (en) 2022-03-23 2022-03-23 Code annotation rate statistical method based on lexical analysis technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210168613.0A CN114546404A (en) 2022-03-23 2022-03-23 Code annotation rate statistical method based on lexical analysis technology

Publications (1)

Publication Number Publication Date
CN114546404A true CN114546404A (en) 2022-05-27

Family

ID=81677453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210168613.0A Pending CN114546404A (en) 2022-03-23 2022-03-23 Code annotation rate statistical method based on lexical analysis technology

Country Status (1)

Country Link
CN (1) CN114546404A (en)

Similar Documents

Publication Publication Date Title
US9710243B2 (en) Parser that uses a reflection technique to build a program semantic tree
CN108345468B (en) Programming language code duplication checking method based on tree and sequence similarity
CN102339252A (en) Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching
CN109308289A (en) A kind of log parsing template and the log analytic method based on the template
Feng et al. A code comparison algorithm based on AST for plagiarism detection
Tsarfaty et al. Cross-framework evaluation for statistical parsing
CN1877531A (en) Embedded compiled system scanner accomplishing method
CN108563561B (en) Program implicit constraint extraction method and system
CN115481396A (en) NC code abnormality detection method, device, equipment and storage medium
CN110221836A (en) A kind of lexical analysis tool
CN101079890B (en) A method and device for generating characteristic code and identifying status machine
US9436664B2 (en) Performing multiple scope based search and replace within a document
Cooke-Fox et al. Computer translation of IUPAC systematic organic chemical nomenclature. 3. Syntax analysis and semantic processing
CN114546404A (en) Code annotation rate statistical method based on lexical analysis technology
CN111913874B (en) Software defect tracing method based on syntactic structure change analysis
CN112181426B (en) Assembly program control flow path detection method and device
EP4242832A1 (en) Method and apparatus for parsing programming language, and non-volatile storage medium
CN107153564B (en) Lexical analysis tool
CN115858219A (en) Token conversion-based multi-sequence log analysis method and system
CN114327614A (en) Method and application for recording and analyzing data flow of reference model
CN102063423B (en) Disambiguation method and device
CN113032366A (en) SQL syntax tree analysis method based on Flex and Bison
CN109597624A (en) A kind of method that SQL is formatted
Fujita et al. Measurement Analysis and Fault Proneness Indication in Product Line Applications (PLA)
CN113220800B (en) ANTLR-based data field blood-edge analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination