CN110704308B

CN110704308B - Multistage feature extraction method

Info

Publication number: CN110704308B
Application number: CN201910857082.4A
Authority: CN
Inventors: 程华; 王明扬; 吕正辉
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2022-09-09
Anticipated expiration: 2039-09-11
Also published as: CN110704308A

Abstract

The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection. It is characterized by comprising: acquiring and storing a mixed feature set of each software project of a code base; the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project. The comprehensive mixed feature set of each software project in the code library is obtained and stored in advance in the code library, so that the information of the software project at multiple levels such as folders, files, functions, code segments and the like can be comprehensively described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the code base in real time, so that the comparison speed is improved.

Description

Multistage feature extraction method

Technical Field

The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection.

Background

The detection of repeated codes (also called clone codes) is an important task in the development and maintenance activities of computer software, and is widely applied in a plurality of fields of source code plagiarism detection, software component library inquiry, software defect detection, program understanding and the like.

Application publication No. CN101697121A, application publication date 2010, 4-month-21-day of the invention patent application discloses a code similarity detection method based on program source code for analysis: respectively analyzing two sections of source codes to be detected into control dependency trees of two system dependency graphs, and respectively executing basic code standardization; extracting a candidate code control dependency tree of the control dependency tree after two basic codes are standardized by a utilization quantitative value method; performing high-level code standardization operation on the extracted candidate similar codes; and calculating semantic similarity to obtain a similarity result, and completing code similarity detection. The method solves the problems that the similarity detection accuracy of codes with different univocal meanings similar to grammar representation is low, the calculation complexity is high, and the similarity detection of large-scale program codes cannot be realized in the prior art.

Conventional code similarity analysis systems, such as the above-mentioned patents, characterize code using relatively single features based on their application scenarios and code sizes: the analysis scale is basically fixed as a code line, one of four angles of text/lexical/syntactic/semantic is selected, and a single value/string/tree/graph characteristic is constructed. The single feature is widely applied to the analysis of code similarity in projects, but is not suitable for the analysis of code similarity between projects: as code specifications dramatically increase, flat single features do not provide full code delineation.

The application publication number CN109542766A, the application publication date 2019, 3, 29 and the method for rapidly detecting the similarity of the large-scale program based on code mapping and lexical analysis and generating the evidence, and the method for detecting the plagiarism and generating the evidence of the large-scale software sample by adopting a two-layer similarity detection method comprises the following steps: firstly, carrying out coarse-grained similarity analysis on a large-scale program by using a code mapping method, and quickly searching a suspected similar program; and then, performing fine-grained analysis on the suspicious similar programs by adopting lexical analysis, judging program similarity, and quickly and accurately finding plagiarism codes in large-scale samples.

In a conventional code similarity analysis system such as the above patent, all code features are calculated in real time in the system, which is relatively costly and long. In addition, the user cannot adjust the comparison strategy and the comparison condition according to the requirements on the detection type, the detection precision and the detection speed. And the similarity analysis of the large-scale autonomous mixed-source software adopts inter-project comparison, namely, software projects input into a similarity analysis system need to be compared with a huge code base in the system. Different software to be tested has different code amount, detection requirements and the like, which characteristics need to be calculated can be determined after the software to be tested is input, and the characteristics of the code library are calculated in real time after the software to be tested is input, so that the time is consumed.

Disclosure of Invention

The invention aims to provide a multi-stage feature extraction method which is applicable to code similarity analysis among projects and can provide comprehensive code description when the code scale is increased sharply.

A multi-stage feature extraction method is characterized in that:

acquiring and storing a mixed feature set of each software project of a code base;

the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project.

According to the technical scheme, the comprehensive mixed feature set of each software project in the code base is obtained and stored in the code base in advance, so that the information of the software project at multiple levels such as folders, files, functions and code segments can be completely described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the software to be tested in real time, so that the comparison speed is increased.

Further, after acquiring and storing the mixed feature set, the method further includes: acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected; and performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base. And a plurality of mixed feature subsets which can be formed based on the mixed feature set so as to meet different testing requirements, and the usability of the system is remarkably improved.

Preferably, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level features of the corresponding software project.

Preferably, the folder statistical information includes the number of files, the file type, the file size, and the programming language of the folder; the association information between the files/functions/variables contained in the folder comprises a file association graph and a function cross-file call graph.

Preferably, file statistical information of each file in the software project and associated information between functions in each file are acquired as file-level features of the corresponding software project.

Preferably, the file statistical information comprises an API call type, API call times, a static variable type, static variable definition times and static variable use times; the associated information among the functions in the file comprises a function call relation graph.

Preferably, function statistical information of each function in the software project, structured semantic information of codes in the function and structured grammar information of the codes in the function are acquired as the function-level features of the corresponding software project.

Preferably, the function statistical information includes code structure statistical information and variable statistical information; the structured semantic information of the code in the function comprises a code program dependency graph; the structured syntax information for code within the function includes a code abstraction syntax tree.

Preferably, the original text information, symbol information, and definition and use information of variables in different contexts of each code segment in the software project are obtained.

Preferably, the code original text information comprises character string information formed by standard preprocessing of code segments; the code symbol information includes a symbol sequence based on the code original file.

The invention has the following beneficial effects:

the multi-level software project mixed feature set can represent information of codes of the software project in various scales such as a folder level, a file level, a function level and a code segment level; the code characteristics are characterized from multiple angles of semantics, grammar, lexical methods, texts and the like; various expressions such as numerical values, strings, trees, graphs and the like are adopted. The mixed feature complete set can be flexibly decomposed into a plurality of mixed feature subsets, different testing requirements are met, and the usability of the system is obviously improved.

Drawings

Fig. 1 is a schematic diagram of a typical application scenario of the method of the present invention.

Detailed Description

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Unless otherwise defined, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that the conventional terms should be interpreted as having a meaning that is consistent with their meaning in the relevant art and this disclosure. The present disclosure is to be considered as an example of the invention and is not intended to limit the invention to the particular embodiments.

A multi-level feature extraction method includes:

step S1, acquiring and storing a mixed feature set of each software project of the code library;

step S2, acquiring one or more of the folder-level feature, the file-level feature, the function-level feature, and the code segment-level feature of the to-be-detected software item according to the detection requirement of the to-be-detected software item.

And step S3, performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code library.

The mixed feature set comprises folder level features for representing the structure of each folder in the software project, file level features for representing the semantics of each file in the software project, function level features for representing the semantics and syntax of each function in the software project, and code segment level features for representing the syntax, the semantics and the text of each code segment in the software project.

Folder level features

Folder-level features focus on the delineation of folder structures, which are used to characterize each folder structure in a software project. Two main types of features are included. One type is folder statistics, including the number of files contained in the folder, file type, file size, programming language, creation time, and the like. Another class is features that characterize associations between files/functions/variables contained within folders, including file association graphs and function cross-file call graphs, etc.

In steps S1 and S2, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level characteristics of the corresponding software project.

Document level features

The file-level features concern the depiction of coarse-grained information such as functions and static variables in files and are used for representing the semantics of all files in a software project. The file-level features may take the form of a mixture of values, graphs, etc., which contain primarily two types of features. One type is file statistical information, and comprises an API calling type, API calling times, a static variable type, static variable definition times, static variable use times and the like; another type is a feature that characterizes associations between functions within a file, such as a function call relationship graph.

The file statistics of each file in the software project and the association information between the functions in each file are obtained as the file-level characteristics of the corresponding software project in step S1 and step S2.

Function level characteristics

The function-level features concern the depiction of special code structures such as function inner loop, branch and the like and the information such as the relation between sentences and the like, and are used for representing semantic and syntactic information. The features of the function level may take the form of a mixture of values, trees, graphs, etc., which contain three classes of features: the first type is function statistics, which include code structure statistics (e.g., form, number of loops), variable statistics (e.g., number of definitions of variables, number of uses), and the like. The second type is structured semantic information of code within a function, i.e., a Program Dependency Graph (PDG), which contains control flow and data flow information within the function. The third type is the structured Syntax information of the code in the function, namely Abstract Syntax Tree (AST), which contains the operation type, operation data, context, and other information of each statement in the function.

The function statistical information of each function in the software project, the structured semantic information of codes in the function and the structured grammar information of the codes in the function are obtained in the steps S1 and S2 to be used as the function level characteristics of the corresponding software project.

Code segment level features

Features at the code segment level focus on text and lexical information within the code segments, and the use of variables to characterize grammars, lexical and textual information. The features at the code segment level may take the form of a mixture of values, strings, etc., which contain three types of features: one is the original text information of the code segment, and the code is processed by standard pretreatment including removing blank space, replacing constant and the like to form character string information, and the character string can further calculate a hash value for comparison. The second type is symbol information of a code segment, the code is converted into a symbol (Token) sequence through lexical analysis, and the word frequency of each Token and the like are calculated. The third category is where variables are defined and used in different contexts, such as what statistics occur in arithmetic operations.

The original text information, symbol information, and definition and usage information of variables in different contexts of each code segment in the software project are obtained in steps S1 and S2.

In summary, the invention extracts the folder, file, function, and code segment level features of the software project code, and represents the semantics, syntax, lexical, and text information of the code, where the form of the features may be values, strings, trees, or diagrams.

Fig. 1 shows a typical application scenario of the method of the present invention: the code base stores the mixed feature complete set of each software project, and the software to be detected can extract different mixed feature subsets for matching according to different testing requirements.

For example, under the first test requirement, it is necessary to mainly check whether the software to be tested has been rewritten by renaming variables, adding empty rows, and the like to the open source software library, and then the mixed feature subset thereof will mainly include the syntactic/semantic features at the code segment level. That is, in the above step S2 of the present embodiment, symbol information at the code segment level of the software item to be detected, definition and use information of variables in different contexts are calculated in real time. In the above step S3 of the present embodiment, the code segment-level symbolic information, definitions of variables in different contexts, and usage information in the software item mixed feature set compared with the software item in the code library are combined into the mixed feature subset 1 of the software item, and the mixed feature subset 1 is matched with the software item features extracted in step S2, so as to calculate the similarity between the two features.

Under the second test requirement, whether the software to be tested completely multiplexes some files in the open source software library needs to be quickly checked, and the mixed feature subset of the software to be tested mainly comprises text/lexical features at the file level. That is, in the above step S2 of the present embodiment, the file statistics of the file level of the item of software to be detected are calculated in real time. In the above step S3 of the present embodiment, the file statistics information at the file level in the software project mixed feature set compared with the software project in the code library is combined into the mixed feature subset 2 of the software project to be matched with the software project features extracted in step S2, so as to calculate the similarity between the two features.

Although embodiments of the present invention have been described, various changes or modifications may be made by one of ordinary skill in the art within the scope of the appended claims.

Claims

1. A multi-stage feature extraction method is characterized by comprising the following steps:

the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project;

after acquiring and storing the mixed feature set, the method further comprises:

acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected;

performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base;

acquiring folder statistical information of each folder in the software project and associated information among files, functions and variables contained in each folder as folder-level characteristics of the corresponding software project;

the related information among the files, the functions and the variables contained in the folder comprises a file related graph and a function cross-file calling graph; acquiring file statistical information of files in the software project and association information between functions in the files as file-level characteristics of the corresponding software project;

the file statistical information comprises an API calling type, API calling times, a static variable type, static variable definition times and static variable use times;

the associated information among the functions in the file comprises a function call relation graph; acquiring function statistical information of each function in the software project, structured semantic information of codes in the function and structured syntactic information of the codes in the function as function-level characteristics of the corresponding software project; the function statistical information comprises code structure statistical information and variable statistical information;

the structured semantic information of the code in the function comprises a code program dependency graph;

the structured syntax information of the code in the function comprises a code abstract syntax tree;

acquiring original text information, symbol information and definition and use information of variables in different contexts of each code segment in a software project;

the original text information of the code segment comprises character string information formed by standard preprocessing of the code segment;

the code segment symbol information includes a symbol sequence based on the code source file.

2. The multi-stage feature extraction method according to claim 1, characterized in that:

the folder statistical information comprises the number of files, the types of the files, the sizes of the files and the programming language of the folders.