CN117113347A

CN117113347A - Large-scale code data feature extraction method and system

Info

Publication number: CN117113347A
Application number: CN202311074578.7A
Authority: CN
Inventors: 赵亚舟; 张世通; 陈梦晖; 冯智
Original assignee: Beijing Keyware Co ltd
Current assignee: Beijing Keyware Co ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-11-24

Abstract

The application provides a large-scale code data feature extraction method and a system, wherein the method comprises the following steps: preprocessing the collected source code dataset; performing lexical analysis on the preprocessed source code dataset, and decomposing the source code dataset into a lexical unit sequence; converting the lexical unit sequence generated by lexical analysis into an abstract syntax tree; capturing semantic information in an abstract syntax tree; analyzing variables and data streams in the preprocessed source code data set to determine definition, use and transmission conditions of the variables among different sentences; feature extraction is performed based on semantic information in the captured abstract syntax tree and analysis of variables and data streams. And generating unified feature symbols of the codes by performing the steps of lexical analysis, grammar analysis, semantic analysis, data stream analysis and the like on each language code. The key information and the characteristics of the codes are extracted and used for analyzing and detecting static scanning of the codes, and the defect that the code characteristics are generated by adopting the Hash or MD5 value and other modes can be overcome.

Description

Large-scale code data feature extraction method and system

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and system for extracting features of large-scale code data.

Background

In the code static detection, the extraction of the code features is crucial, and the accuracy of the code feature extraction and the retrieval result of the features determine the credibility of the code static detection. Code feature extraction refers to the extraction of key information and features from source code for further analysis and detection. Code feature extraction plays a role in code static detection:

a. detecting code quality: code feature extraction may help identify potential problems and flaws in the code, such as code duplication, unused variables, files that are not closed, and so forth. By analyzing the code characteristics, the problems can be found and repaired in advance, and the quality and maintainability of the code are improved.

b. Finding a security hole: code feature extraction may be used to detect security vulnerabilities, such as code injection, buffer overflows, cross-site scripting attacks, etc. By analyzing the code features, potential security vulnerabilities can be discovered and corresponding measures taken to repair or mitigate those vulnerabilities.

c. Evaluation of performance problems: code feature extraction may be used to evaluate code performance issues such as inefficient algorithms, memory leaks, etc. By analyzing the code characteristics, the performance bottleneck can be found and the corresponding optimization can be performed, so that the execution efficiency and response speed of the code are improved.

d. Code style checking: code feature extraction may be used to check if the style and specification of the code meets predetermined criteria. By analyzing the code features, inconsistent naming conventions, setback styles, annotation specifications, etc. can be detected and corresponding suggestions and guidelines provided to improve code readability and maintainability.

e. Detecting code complexity: code feature extraction may help evaluate the complexity of the code, such as the nesting depth of the function, the number of nesting layers of the loop, etc. By analyzing the code features, code segments of excessive complexity can be discovered and optimization suggestions provided to simplify code logic and improve intelligibility.

In general, code feature extraction plays a role in providing critical information and features in code static detection, helping developers to discover and fix potential problems, improving code quality, enhancing security, and optimizing code performance and readability.

The languages currently in use in the market are various, such as: the characteristics of different languages are extracted by more than 30 kinds of C, C++, javaScript, C#, java, objective-C, GO, scala, perl, ruby, typeScript, python, PHP, COBOL, PL/SQL, PL/I, ABAP, VB.NET, VB, RPG, swift, CSS, erlang, groovy, lua, puppet, XML, clojure, F#, haskell and the like, and a plurality of ways are adopted, and useless parts (such as blank, variable name, class name, function name, file name, annotation, brackets and the like) are extracted by removing or replacing the characteristics of the removed codes, so that single characteristics of codes can be analyzed, but the missing report rate and the error rate of static analysis are too high.

Disclosure of Invention

In view of the foregoing, it is an object of one or more embodiments of the present disclosure to provide a method and system for extracting features of large-scale code data to improve the effect of static analysis.

In a first aspect, there is provided a large-scale code data feature extraction method comprising the steps of:

collecting a source code data set, and preprocessing the collected source code data set;

performing lexical analysis on the preprocessed source code dataset to decompose the source code dataset into a lexical unit sequence according to lexical rules;

converting the lexical unit sequence generated by lexical analysis into an abstract syntax tree;

capturing semantic information in an abstract syntax tree;

analyzing variables and data streams in the preprocessed source code dataset to determine definition, use and transmission conditions of the variables among different sentences;

feature extraction is performed based on semantic information in the captured abstract syntax tree, as well as analysis of variables and data streams.

In the technical scheme, the method and the technology realize the generation of the unified feature symbol of the codes by performing the steps of lexical analysis, grammar analysis, semantic analysis, data stream analysis and the like on each language code. The key information and the characteristics of the codes can be extracted and used for analyzing and detecting static scanning of the codes, and the defect that the code characteristics are generated by adopting the Hash or MD5 value and other modes can be overcome.

In a specific embodiment, the pre-processed source code dataset is lexically analyzed to decompose the source code dataset into a sequence of lexical units according to lexical rules; the method specifically comprises the following steps:

defining lexical rules according to the specification of the programming prologue and identifiers, keywords, operators or constants in the grammar;

constructing a lexical rule table according to the defined lexical rule;

reading a source code character by character; for each character, applying a lexical rule table to match possible lexical units; if the matched lexical rule is found, adding the character into the value of the current lexical unit; if any lexical rule cannot be matched, adding the current lexical unit into a lexical unit list, and resetting the current lexical unit;

and returning the lexical unit list as a result.

In a specific implementation manner, the lexical unit sequence generated by lexical analysis is converted into an abstract syntax tree, specifically:

defining grammar rules according to grammar specifications of a programming language; the grammar rules include non-terminators and terminators; wherein the non-terminal symbol represents a combination of the grammar structures and the terminal symbol represents a lexical unit;

constructing a grammar rule table based on the defined grammar rules; wherein each grammar rule comprises a left non-terminal and a right; wherein the left part is a non-terminal symbol, and the right part is a sequence consisting of a terminal symbol and a non-terminal symbol;

Constructing a parser using a top-down or bottom-up method;

converting the lexical unit sequence into an abstract syntax tree according to the syntax rules and a syntax analyzer; matching the lexical unit sequence by using a recursion descent analysis method or an LR analysis method according to the grammar rule table, and constructing nodes of a grammar tree;

returning the generated abstract syntax tree as a result.

In one specific implementation, semantic information in an abstract syntax tree is captured; the method comprises the following steps:

the symbol table construction specifically comprises the following steps: creating a symbol table for storing information of variables, functions or classes; traversing the abstract syntax tree, and adding related information of the abstract syntax tree into a symbol table when encountering a variable declaration or a function definition semantic structure;

the type checking comprises the specific steps of performing type checking on expressions, assignment sentences and the like in codes, and ensuring the consistency and correctness of types; traversing the grammar tree, performing type deduction on each expression, and checking type matching between operators and operands; checking whether the use of the variable meets the semantic rule using the type information in the symbol table;

scope analysis, specifically including analyzing the scope in the code, determining the visibility of variables and access rules; maintaining a scope stack to track variables of the current scope during traversal of the grammar tree; processing the grammar structure of variable declarations and scope entry and exit, adding variables to the current scope or removing variables in a scope stack;

The control flow analysis specifically comprises analysis of control flow structures such as conditional sentences, cyclic sentences and the like in codes; constructing a control flow graph by traversing a grammar tree, and representing conditions, branches and circulation paths in codes; analyzing the control flow graph, and checking whether unreachable code blocks exist or not and whether the loop conditions are correct;

the semantic error detection and report specifically comprises the steps of detecting semantic errors in codes in a semantic analysis process and generating corresponding error reports; if the types are not matched, the variables are not declared, the errors such as repeated declaration and the like are found, the errors are reported to a developer;

in one specific implementation, semantic information in an abstract syntax tree is captured; further comprises:

the intermediate representation is generated, particularly included in the semantic analysis process, and may be generated for use in subsequent optimization and code generation.

In a specific embodiment, the variables and data streams in the preprocessed source code dataset are analyzed to determine the definition, use and delivery of the variables between different statements, specifically:

constructing a control flow graph, specifically comprising constructing a control flow graph of a program, and representing control flow transfer relations among sentences in the program; wherein each basic block in the control flow represents a set of consecutive statements;

Initializing data stream analysis information, specifically including initializing data stream information at the entry and exit of each basic block to store definitions, usage and delivery of variables using a symbol table.

Iterative computation data flow information, specifically includes: iteratively calculating data stream information in each basic block until convergence to update the data stream information using data stream equations and iterative algorithms; wherein the data flow equation includes data flow values at the ingress and egress, and the data flow transfer function is to update the data flow values;

the definition and use of the marking variable specifically comprises: the definition and points of use of the variables are marked in the control flow graph to determine the path of the variables to pass between the different statements.

In a specific embodiment, feature extraction is performed based on semantic information in the captured abstract syntax tree, and analysis of variables and data streams; the method specifically comprises the following steps:

data preparation and preprocessing, specifically comprising collecting and cleaning semantic analyzer data, ensuring data quality and consistency; preprocessing data, removing noise, filling missing values, normalizing and the like

The feature selection and construction specifically comprises the steps of determining features according to task requirements and domain knowledge; selecting statistical features according to the data types; the statistical characteristics of the data can be calculated as original characteristics, and new characteristics can be constructed by combining based on a plurality of original characteristics;

Dividing the data set into a training set, a verification set and a test set; the training set is used for model training, the verification set is used for model selection and tuning, and the test set is used for final model evaluation;

the statistical feature extraction specifically comprises the steps of carrying out statistical analysis on each feature and calculating relevant statistical features; the mean, variance, standard deviation, median, maximum, minimum or percentile are calculated as statistical features.

In a second aspect, there is provided a large-scale code data feature extraction system comprising:

the data acquisition module is used for acquiring a source code data set and preprocessing the acquired source code data set;

the lexical analyzer is used for performing lexical analysis on the preprocessed source code dataset to decompose the source code dataset into a lexical unit sequence according to lexical rules;

the grammar analyzer is used for converting the lexical unit sequence generated by lexical analysis into an abstract grammar tree;

a semantic analyzer for capturing semantic information in the abstract syntax tree;

the data flow analyzer is used for analyzing the variables and the data flow in the preprocessed source code data set to determine the definition, the use and the transmission conditions of the variables among different sentences;

And the feature extractor is used for extracting features according to the semantic information in the captured abstract syntax tree and analysis of variables and data streams.

In a specific embodiment, the lexical analyzer is specifically configured to define lexical rules based on identifiers, keywords, operators, or constants in the specification and grammar of the programming prolog; constructing a lexical rule table according to the defined lexical rule; reading a source code character by character; for each character, applying a lexical rule table to match possible lexical units; if the matched lexical rule is found, adding the character into the value of the current lexical unit; if any lexical rule cannot be matched, adding the current lexical unit into a lexical unit list, and resetting the current lexical unit; and returning the lexical unit list as a result.

In a specific embodiment, the parser is specifically configured to define grammar rules according to a grammar specification of a programming language; the grammar rules include non-terminators and terminators; wherein the non-terminal symbol represents a combination of the grammar structures and the terminal symbol represents a lexical unit;

constructing a grammar rule table based on the defined grammar rules; wherein each grammar rule comprises a left non-terminal and a right; wherein the left part is a non-terminal symbol, and the right part is a sequence consisting of a terminal symbol and a non-terminal symbol; constructing a parser using a top-down or bottom-up method; converting the lexical unit sequence into an abstract syntax tree according to the syntax rules and a syntax analyzer; matching the lexical unit sequence by using a recursion descent analysis method or an LR analysis method according to the grammar rule table, and constructing nodes of a grammar tree; returning the generated abstract syntax tree as a result.

In a third aspect, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for large-scale code data feature extraction as claimed in any one of the preceding claims when executing the program.

In a fourth aspect, a non-transitory computer readable storage medium is provided, the non-transitory computer readable storage medium storing computer instructions for causing the computer to perform any of the large-scale code data feature extraction methods described above.

In a fifth aspect, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the first aspect and any one of the possible designs of the first aspect.

In addition, the technical effects of any of the possible design manners in the third aspect to the fifth aspect may be referred to as effects of different design manners in the method section, and are not described herein.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only one or more embodiments of the present description, from which other drawings can be obtained, without inventive effort, for a person skilled in the art.

FIG. 1 is a block diagram corresponding to a large-scale code data feature extraction method according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for extracting features of large-scale code data according to an embodiment of the present application;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

It is noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present disclosure should be taken in a general sense as understood by one of ordinary skill in the art to which the present disclosure pertains. The use of the terms "first," "second," and the like in one or more embodiments of the present description does not denote any order, quantity, or importance, but rather the terms "first," "second," and the like are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In order to facilitate understanding of the symbol-based large-scale code data feature extraction method provided by the embodiment of the application, an application scene is first described. The embodiment of the application provides a large-scale code data feature extraction method, which belongs to the technical field of static test and is used for static test. In the code static detection, the extraction of the code features is crucial, and the accuracy of the code feature extraction and the retrieval result of the features determine the credibility of the code static detection. Code feature extraction refers to the extraction of key information and features from source code for further analysis and detection. The number of languages in use in the market is various, and most of the languages adopt the mode of removing or replacing code-removing features to extract useless parts (such as blank lines, variable names, class names, function names, file names, notes, brackets and the like), the mode can analyze single features of codes and then generate code features by using a Hash or MD5 algorithm, but the mode is too rough, the context relation among codes, call changes among functions, the association relation among files and some declaration information of users are ignored, and the false alarm rate and false alarm rate of static analysis are too high. Therefore, the embodiment of the application provides a large-scale code data feature extraction method which is used for improving the accuracy of static analysis. The following description is made in connection with specific methods.

Referring to fig. 1, fig. 1 shows a frame diagram related to a large-scale code data feature extraction method according to an embodiment of the present application. The technical roadmap of the symbol-based large-scale code data feature extraction method is mainly divided into a source code database set, a code analyzer (a lexical analyzer, a grammar analyzer, a semantic analyzer and a data stream analyzer), feature extraction, feature fusion, feature engineering, feature representation and modeling and model evaluation. The method and the technology realize the generation of the unified feature symbol of the codes by performing the steps of lexical analysis, grammar analysis, semantic analysis, data stream analysis and the like on each language code. Firstly, decomposing a source code into basic units through lexical analysis to form a lexical structure of the code; then, analyzing the codes into grammar structures through grammar analysis to form grammar trees or abstract grammar trees; next, the meaning and relation of the code are interpreted through semantic analysis, and semantic information of the code is extracted; finally, the data flow characteristics such as variable references, assignment relations, dependency relations and the like are extracted through data flow analysis and analysis of the flow and change of the data in the codes. Through the steps, key information and characteristics of the codes can be extracted and used for analysis and detection of static scanning of the codes, and the defect that the code characteristics are generated in a Hash or MD5 value mode and the like can be overcome. Specific steps of the large-scale code data feature extraction method provided by the embodiment of the application are described in detail below.

Referring to fig. 2, fig. 2 shows specific steps of a symbol-based large-scale code data feature extraction method according to an embodiment of the present application. The method specifically comprises the following steps:

step 001: collecting a source code data set, and preprocessing the collected source code data set;

specifically, a source code dataset is collected and the data is preprocessed as needed, such as to remove blank characters, standardized line breaks, and the like.

Step 002: performing lexical analysis on the preprocessed source code dataset to decompose the source code dataset into a lexical unit sequence according to lexical rules;

specifically, a lexical analyzer tool (e.g., ANTLR, lex, etc.) is used to perform lexical analysis, and the code is decomposed into a sequence of lexical units according to the lexical rules of the programming language. The method comprises the following specific steps:

step a, defining lexical rules according to identifiers, keywords, operators or constants in the specifications and grammar of the programming prologue.

Illustratively, the lexical rules are defined according to the specifications and syntax of the programming language. Such as identifiers, keywords, operators, constants, etc.

Step b, constructing a lexical rule table according to the defined lexical rule;

specifically, a lexical rule table is constructed based on the defined lexical rules. Each lexical rule contains a pattern and a corresponding lexical unit type. For example, the pattern may be a regular expression, and the lexical unit type may be an identifier, a keyword, or the like.

Step c, reading a source code character by character;

specifically, first, a source code file to be lexically analyzed is read. The list of lexical elements is then initialized. And then creating an empty lexical unit list for storing the results after lexical analysis. Then for each character, applying a lexical rule table to match possible lexical units; if the matched lexical rule is found, adding the character into the value of the current lexical unit; if any lexical rule cannot be matched, adding the current lexical unit into a lexical unit list, and resetting the current lexical unit;

and d, returning the lexical unit list as a result.

Specifically, after the lexical analysis is completed, the list of lexical units is returned as a result.

Step 003: converting the lexical unit sequence generated by lexical analysis into an abstract syntax tree;

specifically, the code syntax analysis is a process of converting a lexical unit sequence generated by the lexical analysis into an abstract syntax tree (Abstract Syntax Tree, AST). The following is a specific implementation step of code syntax analysis:

step 01: defining grammar rules according to grammar specifications of a programming language; the grammar rules include non-terminators and terminators; wherein the non-terminal symbol represents a combination of the grammar structures and the terminal symbol represents a lexical unit;

Step 02: constructing a grammar rule table based on the defined grammar rules; wherein each grammar rule comprises a left non-terminal and a right; wherein the left part is a non-terminal symbol, and the right part is a sequence consisting of a terminal symbol and a non-terminal symbol;

step 03: constructing a parser using a top-down or bottom-up method;

specifically, a Top-Down (Top-Down) or Bottom-Up (Bottom-Up) method is used to construct the parser.

The top-down approach is commonly used as a recursive descent analysis, which recursively builds down a syntax tree according to syntax rules.

The bottom-up approach is commonly used as LR analysis, which starts from an input string and gradually combines terminals and non-terminals into a larger grammar structure, ultimately building a grammar tree.

Step 04: converting the lexical unit sequence into an abstract syntax tree according to the syntax rules and a syntax analyzer; matching the lexical unit sequence by using a recursion descent analysis method or an LR analysis method according to the grammar rule table, and constructing nodes of a grammar tree;

specifically, it is preferred that the sequence of lexical units generated by lexical analysis be provided as input to a parser. Then, the sequence of lexical units is converted into an abstract syntax tree according to the syntax rules and a syntax analyzer. According to the grammar rule table, matching the lexical unit sequence by using a recursion descent analysis method or an LR analysis method, and constructing nodes of the grammar tree.

Step 05: returning the generated abstract syntax tree as a result.

In summary, the specific implementation steps of the code syntax analysis include defining a syntax rule, constructing a syntax rule table, constructing a syntax analyzer, inputting a lexical unit, constructing a syntax tree, and finally returning an abstract syntax tree as a result. The development process can be simplified by means of existing tools and libraries when implemented. For example, the parser code may be automatically generated according to grammar rules using ANTLR, yacc/Bison, javaParser, or the like tools.

Step 004: capturing semantic information in an abstract syntax tree;

specifically, the code semantic analysis is to further analyze the code based on the grammar analysis to capture the semantic information of the code. The following is a specific implementation step of code semantic analysis:

step A: constructing a symbol table;

the method specifically comprises the following steps: creating a symbol table for storing information of variables, functions, classes and the like; traversing the abstract syntax tree, adding its related information to the symbol table when encountering a variable declaration or a function definition semantic structure.

And (B) step (B): type checking;

the method specifically comprises the steps of performing type checking on expressions, assignment sentences and the like in codes, and ensuring consistency and correctness of types; traversing the grammar tree, performing type deduction on each expression, and checking type matching between operators and operands; the type information in the symbol table is used to check whether the use of the variables complies with the semantic rules.

Step C: performing action domain analysis;

specifically, the scope in the code is analyzed to determine the visibility and access rules of the variables. In traversing the grammar tree, a scope stack is maintained to track variables of the current scope. The syntax structure of variable declarations and scope entries and exits is processed, and variables are added to the current scope or variables in the scope stack are removed.

Step D: control flow analysis;

analyzing control flow structures such as conditional sentences, cyclic sentences and the like in codes; constructing a control flow graph by traversing a grammar tree, and representing conditions, branches and circulation paths in codes; analyzing the control flow graph, and checking whether unreachable code blocks exist or not and whether the loop conditions are correct;

step E: semantic error detection and reporting.

Specifically, in the semantic analysis process, semantic errors in the code are detected, and a corresponding error report is generated. If the types are not matched, the variables are not declared, the errors such as repeated declaration and the like are found, the errors are reported to a developer;

in addition, step F may be included to generate an intermediate representation.

Specifically included, in the process of semantic analysis, intermediate representations (e.g., triple address codes, intermediate languages) can be generated for use in subsequent optimizations and code generation.

In summary, the specific implementation steps of the code semantic analysis include symbol table construction, type checking, scope analysis, control flow analysis, semantic error detection and reporting, and possibly intermediate representation generation. The development process can be simplified by means of existing tools and libraries when implemented. For example, symbol information is managed using a symbol table data structure, type checking is performed using a type derivation algorithm, and a control flow graph algorithm is used to analyze the control flow structure.

Step 005: analyzing variables and data streams in the preprocessed source code dataset to determine definition, use and transmission conditions of the variables among different sentences;

in particular, code data stream analysis is the analysis of variables and data streams in a program to determine the definition, use and delivery of the variables between different statements. The following is a specific implementation step of code data stream analysis:

step 1: constructing a control flow graph;

specifically including a control flow graph (Control Flow Graph, CFG) of the build program, representing control flow transfer relationships between statements in the program; wherein each basic block in the control flow represents a set of consecutive statements;

Step 2: and initializing data flow analysis information.

Specifically, the data stream information is initialized at the entry and exit of each basic block to use a symbol table to store the definition, use and delivery of variables.

Step 3: and iteratively calculating data flow information.

The method specifically comprises the following steps: iteratively calculating data stream information in each basic block until convergence to update the data stream information using data stream equations and iterative algorithms; wherein the data flow equation includes data flow values at the ingress and egress, and a data flow transfer function (transfer function) for updating the data flow values.

Step 4: definition and use of tag variables.

The method specifically comprises the following steps: the definition and points of use of the variables are marked in the control flow graph to determine the path of the variables to pass between the different statements.

Step 5: application of data stream analysis results.

Specifically, according to the result of the data flow analysis, further optimization, elimination of unnecessary codes, constant propagation, simplification of complex expressions, and the like can be performed.

In summary, the specific implementation steps of the code data stream analysis include constructing a control flow graph, initializing data stream analysis information, iteratively calculating data stream information, defining and using tag variables, and applying data stream analysis results. The development process can be simplified by means of existing tools and libraries when implemented. For example, control flow graphs may be automatically constructed using tools such as LLVM, soot, dataFlowAnalyzer and provide a data flow analysis framework and algorithm.

Step 006: feature extraction is performed based on semantic information in the captured abstract syntax tree, as well as analysis of variables and data streams.

The method specifically comprises the following steps:

step 010, data preparation and preprocessing.

The method specifically comprises the steps of collecting and cleaning semantic analyzer data, and ensuring data quality and consistency; preprocessing the data, removing noise, filling the missing value and normalizing.

Step 020: feature selection and construction.

Specifically, determining characteristics according to task requirements and domain knowledge; according to the data type, a statistical feature is selected. Such as mean, variance, maximum, minimum, percentile, etc.

Statistical features of the data may be calculated as original features, or new features may be constructed based on a combination of multiple original features.

Step 030: dividing a data set;

dividing a characteristic data set into a training set, a verification set and a test set; the training set is used for model training, the verification set is used for model selection and tuning, and the test set is used for final model evaluation;

step 040: and (5) extracting statistical characteristics.

The method specifically comprises the steps of carrying out statistical analysis on each feature and calculating relevant statistical features; the mean, variance, standard deviation, median, maximum, minimum or percentile are calculated as statistical features.

In addition, dynamic statistical features of the features, such as a sliding average, a sliding variance, etc., can also be calculated based on sliding windows or time windows, etc.

As can be seen from the above description, the method and the technology provided by the embodiment of the application realize the generation of the unified feature symbol of the codes by performing the steps of lexical analysis, grammar analysis, semantic analysis, data flow analysis and the like on each language code. The key information and the characteristics of the codes can be extracted and used for analyzing and detecting static scanning of the codes, and the defect that the code characteristics are generated by adopting the Hash or MD5 value and other modes can be overcome.

In addition, the embodiment of the application also provides a large-scale code data feature extraction system which is mainly divided into a source code database set, a code analyzer (lexical analyzer, a grammar analyzer, a semantic analyzer and data stream analysis), feature extraction, feature fusion, feature engineering, feature representation and modeling and model evaluation. It can be specifically divided into: the system comprises a data acquisition module, a lexical analyzer, a grammar analyzer, a semantic analyzer, a data stream analyzer and a feature extractor.

The data acquisition module is used for acquiring a source code data set and preprocessing the acquired source code data set; the lexical analyzer is used for performing lexical analysis on the preprocessed source code dataset to decompose the source code dataset into a lexical unit sequence according to lexical rules; the grammar analyzer is used for converting the lexical unit sequence generated by lexical analysis into an abstract grammar tree; the semantic analyzer is used for capturing semantic information in the abstract syntax tree; the data flow analyzer is used for analyzing variables and data flows in the preprocessed source code data set to determine definition, use and transmission conditions of the variables among different sentences; the feature extractor is used for extracting features according to semantic information in the captured abstract syntax tree and analysis of variables and data streams.

As an alternative example, the lexical analyzer is specifically configured to define lexical rules based on identifiers, keywords, operators, or constants in the specifications and grammar of the programming prolog; constructing a lexical rule table according to the defined lexical rule; reading a source code character by character; for each character, applying a lexical rule table to match possible lexical units; if the matched lexical rule is found, adding the character into the value of the current lexical unit; if any lexical rule cannot be matched, adding the current lexical unit into a lexical unit list, and resetting the current lexical unit; and returning the lexical unit list as a result.

As an alternative example, the parser is specifically configured to define grammar rules according to a grammar specification of a programming language; the grammar rules include non-terminators and terminators; wherein the non-terminal symbol represents a combination of the grammar structures and the terminal symbol represents a lexical unit;

The embodiment of the application also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor executes the program to realize the large-scale code data characteristic extraction method according to any one of the above.

The embodiment of the application also provides a non-transitory computer readable storage medium, which stores computer instructions for causing the computer to execute any one of the large-scale code data feature extraction methods.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the above possible designs of the present application.

It should be noted that the methods of one or more embodiments of the present description may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of one or more embodiments of the present description, the devices interacting with each other to accomplish the methods.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in one or more pieces of software and/or hardware when implementing one or more embodiments of the present description.

The device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Fig. 3 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (RandomAccess Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present disclosure are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; combinations of features of the above embodiments or in different embodiments are also possible within the spirit of the present disclosure, steps may be implemented in any order, and there are many other variations of the different aspects of one or more embodiments described above which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure one or more embodiments of the present description. Furthermore, the apparatus may be shown in block diagram form in order to avoid obscuring the one or more embodiments of the present description, and also in view of the fact that specifics with respect to implementation of such block diagram apparatus are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present disclosure is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the one or more embodiments of the disclosure, are therefore intended to be included within the scope of the disclosure.

Claims

1. The large-scale code data characteristic extraction method is characterized by comprising the following steps of:

capturing semantic information in an abstract syntax tree;

2. The large-scale code data feature extraction method of claim 1, wherein the preprocessed source code data set is lexically analyzed to decompose the source code data set into a lexical unit sequence according to lexical rules; the method specifically comprises the following steps:

constructing a lexical rule table according to the defined lexical rule;

and returning the lexical unit list as a result.

3. The method for extracting large-scale code data features according to claim 1, wherein the step of converting the lexical unit sequence generated by the lexical analysis into an abstract syntax tree comprises the following steps:

constructing a parser using a top-down or bottom-up method;

returning the generated abstract syntax tree as a result.

4. A method of extracting large-scale code data features as claimed in claim 3, wherein semantic information in abstract syntax trees is captured; the method comprises the following steps:

The type checking comprises the specific steps of performing type checking on expressions and assignment sentences in codes, and ensuring consistency and correctness of types; traversing the grammar tree, performing type deduction on each expression, and checking type matching between operators and operands; checking whether the use of the variable meets the semantic rule using the type information in the symbol table;

the control flow analysis specifically comprises the steps of analyzing a conditional statement and a circulating statement control flow structure in a code; constructing a control flow graph by traversing a grammar tree, and representing conditions, branches and circulation paths in codes; analyzing the control flow graph, and checking whether unreachable code blocks exist or not and whether the loop conditions are correct;

the semantic error detection and report specifically comprises the steps of detecting semantic errors in codes in a semantic analysis process and generating corresponding error reports; if a type mismatch, an unclaimed variable, a repeat claim error is found, it is reported to the developer.

5. The method of claim 4, wherein the semantic information in the abstract syntax tree is captured; further comprises:

6. The method for extracting features of large-scale code data according to claim 4, wherein the variables and data streams in the preprocessed source code data set are analyzed to determine the definition, use and transfer of the variables among different sentences, specifically:

7. The method for large-scale code data feature extraction of claim 5,

extracting features according to semantic information in the captured abstract syntax tree and analysis of variables and data streams; the method specifically comprises the following steps:

data preparation and preprocessing, specifically comprising collecting and cleaning semantic analyzer data, ensuring data quality and consistency; preprocessing data, removing noise, filling missing values and normalizing

8. A large-scale code data feature extraction system, comprising:

9. The large-scale code data feature extraction system of claim 8, wherein the lexical analyzer is specifically configured to define lexical rules based on identifiers, keywords, operators, or constants in the specifications and grammar of the programming prolog; constructing a lexical rule table according to the defined lexical rule; reading a source code character by character; for each character, applying a lexical rule table to match possible lexical units; if the matched lexical rule is found, adding the character into the value of the current lexical unit; if any lexical rule cannot be matched, adding the current lexical unit into a lexical unit list, and resetting the current lexical unit; and returning the lexical unit list as a result.

10. The large-scale code data feature extraction system of claim 7, wherein the parser is operable to define grammar rules in accordance with a grammar specification of a programming language; the grammar rules include non-terminators and terminators; wherein the non-terminal symbol represents a combination of the grammar structures and the terminal symbol represents a lexical unit;