CN110704308B - Multistage feature extraction method - Google Patents

Multistage feature extraction method Download PDF

Info

Publication number
CN110704308B
CN110704308B CN201910857082.4A CN201910857082A CN110704308B CN 110704308 B CN110704308 B CN 110704308B CN 201910857082 A CN201910857082 A CN 201910857082A CN 110704308 B CN110704308 B CN 110704308B
Authority
CN
China
Prior art keywords
code
information
software project
function
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910857082.4A
Other languages
Chinese (zh)
Other versions
CN110704308A (en
Inventor
程华
王明扬
吕正辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910857082.4A priority Critical patent/CN110704308B/en
Publication of CN110704308A publication Critical patent/CN110704308A/en
Application granted granted Critical
Publication of CN110704308B publication Critical patent/CN110704308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection. It is characterized by comprising: acquiring and storing a mixed feature set of each software project of a code base; the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project. The comprehensive mixed feature set of each software project in the code library is obtained and stored in advance in the code library, so that the information of the software project at multiple levels such as folders, files, functions, code segments and the like can be comprehensively described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the code base in real time, so that the comparison speed is improved.

Description

Multistage feature extraction method
Technical Field
The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection.
Background
The detection of repeated codes (also called clone codes) is an important task in the development and maintenance activities of computer software, and is widely applied in a plurality of fields of source code plagiarism detection, software component library inquiry, software defect detection, program understanding and the like.
Application publication No. CN101697121A, application publication date 2010, 4-month-21-day of the invention patent application discloses a code similarity detection method based on program source code for analysis: respectively analyzing two sections of source codes to be detected into control dependency trees of two system dependency graphs, and respectively executing basic code standardization; extracting a candidate code control dependency tree of the control dependency tree after two basic codes are standardized by a utilization quantitative value method; performing high-level code standardization operation on the extracted candidate similar codes; and calculating semantic similarity to obtain a similarity result, and completing code similarity detection. The method solves the problems that the similarity detection accuracy of codes with different univocal meanings similar to grammar representation is low, the calculation complexity is high, and the similarity detection of large-scale program codes cannot be realized in the prior art.
Conventional code similarity analysis systems, such as the above-mentioned patents, characterize code using relatively single features based on their application scenarios and code sizes: the analysis scale is basically fixed as a code line, one of four angles of text/lexical/syntactic/semantic is selected, and a single value/string/tree/graph characteristic is constructed. The single feature is widely applied to the analysis of code similarity in projects, but is not suitable for the analysis of code similarity between projects: as code specifications dramatically increase, flat single features do not provide full code delineation.
The application publication number CN109542766A, the application publication date 2019, 3, 29 and the method for rapidly detecting the similarity of the large-scale program based on code mapping and lexical analysis and generating the evidence, and the method for detecting the plagiarism and generating the evidence of the large-scale software sample by adopting a two-layer similarity detection method comprises the following steps: firstly, carrying out coarse-grained similarity analysis on a large-scale program by using a code mapping method, and quickly searching a suspected similar program; and then, performing fine-grained analysis on the suspicious similar programs by adopting lexical analysis, judging program similarity, and quickly and accurately finding plagiarism codes in large-scale samples.
In a conventional code similarity analysis system such as the above patent, all code features are calculated in real time in the system, which is relatively costly and long. In addition, the user cannot adjust the comparison strategy and the comparison condition according to the requirements on the detection type, the detection precision and the detection speed. And the similarity analysis of the large-scale autonomous mixed-source software adopts inter-project comparison, namely, software projects input into a similarity analysis system need to be compared with a huge code base in the system. Different software to be tested has different code amount, detection requirements and the like, which characteristics need to be calculated can be determined after the software to be tested is input, and the characteristics of the code library are calculated in real time after the software to be tested is input, so that the time is consumed.
Disclosure of Invention
The invention aims to provide a multi-stage feature extraction method which is applicable to code similarity analysis among projects and can provide comprehensive code description when the code scale is increased sharply.
A multi-stage feature extraction method is characterized in that:
acquiring and storing a mixed feature set of each software project of a code base;
the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project.
According to the technical scheme, the comprehensive mixed feature set of each software project in the code base is obtained and stored in the code base in advance, so that the information of the software project at multiple levels such as folders, files, functions and code segments can be completely described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the software to be tested in real time, so that the comparison speed is increased.
Further, after acquiring and storing the mixed feature set, the method further includes: acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected; and performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base. And a plurality of mixed feature subsets which can be formed based on the mixed feature set so as to meet different testing requirements, and the usability of the system is remarkably improved.
Preferably, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level features of the corresponding software project.
Preferably, the folder statistical information includes the number of files, the file type, the file size, and the programming language of the folder; the association information between the files/functions/variables contained in the folder comprises a file association graph and a function cross-file call graph.
Preferably, file statistical information of each file in the software project and associated information between functions in each file are acquired as file-level features of the corresponding software project.
Preferably, the file statistical information comprises an API call type, API call times, a static variable type, static variable definition times and static variable use times; the associated information among the functions in the file comprises a function call relation graph.
Preferably, function statistical information of each function in the software project, structured semantic information of codes in the function and structured grammar information of the codes in the function are acquired as the function-level features of the corresponding software project.
Preferably, the function statistical information includes code structure statistical information and variable statistical information; the structured semantic information of the code in the function comprises a code program dependency graph; the structured syntax information for code within the function includes a code abstraction syntax tree.
Preferably, the original text information, symbol information, and definition and use information of variables in different contexts of each code segment in the software project are obtained.
Preferably, the code original text information comprises character string information formed by standard preprocessing of code segments; the code symbol information includes a symbol sequence based on the code original file.
The invention has the following beneficial effects:
the multi-level software project mixed feature set can represent information of codes of the software project in various scales such as a folder level, a file level, a function level and a code segment level; the code characteristics are characterized from multiple angles of semantics, grammar, lexical methods, texts and the like; various expressions such as numerical values, strings, trees, graphs and the like are adopted. The mixed feature complete set can be flexibly decomposed into a plurality of mixed feature subsets, different testing requirements are met, and the usability of the system is obviously improved.
Drawings
Fig. 1 is a schematic diagram of a typical application scenario of the method of the present invention.
Detailed Description
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Unless otherwise defined, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that the conventional terms should be interpreted as having a meaning that is consistent with their meaning in the relevant art and this disclosure. The present disclosure is to be considered as an example of the invention and is not intended to limit the invention to the particular embodiments.
A multi-level feature extraction method includes:
step S1, acquiring and storing a mixed feature set of each software project of the code library;
step S2, acquiring one or more of the folder-level feature, the file-level feature, the function-level feature, and the code segment-level feature of the to-be-detected software item according to the detection requirement of the to-be-detected software item.
And step S3, performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code library.
The mixed feature set comprises folder level features for representing the structure of each folder in the software project, file level features for representing the semantics of each file in the software project, function level features for representing the semantics and syntax of each function in the software project, and code segment level features for representing the syntax, the semantics and the text of each code segment in the software project.
Folder level features
Folder-level features focus on the delineation of folder structures, which are used to characterize each folder structure in a software project. Two main types of features are included. One type is folder statistics, including the number of files contained in the folder, file type, file size, programming language, creation time, and the like. Another class is features that characterize associations between files/functions/variables contained within folders, including file association graphs and function cross-file call graphs, etc.
In steps S1 and S2, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level characteristics of the corresponding software project.
Document level features
The file-level features concern the depiction of coarse-grained information such as functions and static variables in files and are used for representing the semantics of all files in a software project. The file-level features may take the form of a mixture of values, graphs, etc., which contain primarily two types of features. One type is file statistical information, and comprises an API calling type, API calling times, a static variable type, static variable definition times, static variable use times and the like; another type is a feature that characterizes associations between functions within a file, such as a function call relationship graph.
The file statistics of each file in the software project and the association information between the functions in each file are obtained as the file-level characteristics of the corresponding software project in step S1 and step S2.
Function level characteristics
The function-level features concern the depiction of special code structures such as function inner loop, branch and the like and the information such as the relation between sentences and the like, and are used for representing semantic and syntactic information. The features of the function level may take the form of a mixture of values, trees, graphs, etc., which contain three classes of features: the first type is function statistics, which include code structure statistics (e.g., form, number of loops), variable statistics (e.g., number of definitions of variables, number of uses), and the like. The second type is structured semantic information of code within a function, i.e., a Program Dependency Graph (PDG), which contains control flow and data flow information within the function. The third type is the structured Syntax information of the code in the function, namely Abstract Syntax Tree (AST), which contains the operation type, operation data, context, and other information of each statement in the function.
The function statistical information of each function in the software project, the structured semantic information of codes in the function and the structured grammar information of the codes in the function are obtained in the steps S1 and S2 to be used as the function level characteristics of the corresponding software project.
Code segment level features
Features at the code segment level focus on text and lexical information within the code segments, and the use of variables to characterize grammars, lexical and textual information. The features at the code segment level may take the form of a mixture of values, strings, etc., which contain three types of features: one is the original text information of the code segment, and the code is processed by standard pretreatment including removing blank space, replacing constant and the like to form character string information, and the character string can further calculate a hash value for comparison. The second type is symbol information of a code segment, the code is converted into a symbol (Token) sequence through lexical analysis, and the word frequency of each Token and the like are calculated. The third category is where variables are defined and used in different contexts, such as what statistics occur in arithmetic operations.
The original text information, symbol information, and definition and usage information of variables in different contexts of each code segment in the software project are obtained in steps S1 and S2.
In summary, the invention extracts the folder, file, function, and code segment level features of the software project code, and represents the semantics, syntax, lexical, and text information of the code, where the form of the features may be values, strings, trees, or diagrams.
Fig. 1 shows a typical application scenario of the method of the present invention: the code base stores the mixed feature complete set of each software project, and the software to be detected can extract different mixed feature subsets for matching according to different testing requirements.
For example, under the first test requirement, it is necessary to mainly check whether the software to be tested has been rewritten by renaming variables, adding empty rows, and the like to the open source software library, and then the mixed feature subset thereof will mainly include the syntactic/semantic features at the code segment level. That is, in the above step S2 of the present embodiment, symbol information at the code segment level of the software item to be detected, definition and use information of variables in different contexts are calculated in real time. In the above step S3 of the present embodiment, the code segment-level symbolic information, definitions of variables in different contexts, and usage information in the software item mixed feature set compared with the software item in the code library are combined into the mixed feature subset 1 of the software item, and the mixed feature subset 1 is matched with the software item features extracted in step S2, so as to calculate the similarity between the two features.
Under the second test requirement, whether the software to be tested completely multiplexes some files in the open source software library needs to be quickly checked, and the mixed feature subset of the software to be tested mainly comprises text/lexical features at the file level. That is, in the above step S2 of the present embodiment, the file statistics of the file level of the item of software to be detected are calculated in real time. In the above step S3 of the present embodiment, the file statistics information at the file level in the software project mixed feature set compared with the software project in the code library is combined into the mixed feature subset 2 of the software project to be matched with the software project features extracted in step S2, so as to calculate the similarity between the two features.
Although embodiments of the present invention have been described, various changes or modifications may be made by one of ordinary skill in the art within the scope of the appended claims.

Claims (2)

1. A multi-stage feature extraction method is characterized by comprising the following steps:
acquiring and storing a mixed feature set of each software project of a code base;
the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project;
after acquiring and storing the mixed feature set, the method further comprises:
acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected;
performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base;
acquiring folder statistical information of each folder in the software project and associated information among files, functions and variables contained in each folder as folder-level characteristics of the corresponding software project;
the related information among the files, the functions and the variables contained in the folder comprises a file related graph and a function cross-file calling graph; acquiring file statistical information of files in the software project and association information between functions in the files as file-level characteristics of the corresponding software project;
the file statistical information comprises an API calling type, API calling times, a static variable type, static variable definition times and static variable use times;
the associated information among the functions in the file comprises a function call relation graph; acquiring function statistical information of each function in the software project, structured semantic information of codes in the function and structured syntactic information of the codes in the function as function-level characteristics of the corresponding software project; the function statistical information comprises code structure statistical information and variable statistical information;
the structured semantic information of the code in the function comprises a code program dependency graph;
the structured syntax information of the code in the function comprises a code abstract syntax tree;
acquiring original text information, symbol information and definition and use information of variables in different contexts of each code segment in a software project;
the original text information of the code segment comprises character string information formed by standard preprocessing of the code segment;
the code segment symbol information includes a symbol sequence based on the code source file.
2. The multi-stage feature extraction method according to claim 1, characterized in that:
the folder statistical information comprises the number of files, the types of the files, the sizes of the files and the programming language of the folders.
CN201910857082.4A 2019-09-11 2019-09-11 Multistage feature extraction method Active CN110704308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910857082.4A CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910857082.4A CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Publications (2)

Publication Number Publication Date
CN110704308A CN110704308A (en) 2020-01-17
CN110704308B true CN110704308B (en) 2022-09-09

Family

ID=69195260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910857082.4A Active CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Country Status (1)

Country Link
CN (1) CN110704308B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching

Also Published As

Publication number Publication date
CN110704308A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
CN108932192B (en) Python program type defect detection method based on abstract syntax tree
Rigby et al. Discovering essential code elements in informal documentation
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
US8539475B2 (en) API backward compatibility checking
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
Liu et al. Automatic detection of outdated comments during code changes
CN114297654A (en) Intelligent contract vulnerability detection method and system for source code hierarchy
Fluri et al. Discovering patterns of change types
US20200226232A1 (en) Method of selecting software files
US20240201984A1 (en) Deep learning-based java program internal annotation generation method and syste
CN101576850B (en) Method for testing improved host-oriented embedded software white box
CN110750297B (en) Python code reference information generation method based on program analysis and text analysis
CN116406459A (en) Code processing method, device, equipment and medium
CN106339313B (en) A kind of abnormal inconsistent automatic testing method of description with document of Java api routines
CN115146282A (en) AST-based source code anomaly detection method and device
CN111881300A (en) Third-party library dependency-oriented knowledge graph construction method and system
CN108563561B (en) Program implicit constraint extraction method and system
CN112199115A (en) Cross-Java byte code and source code line association method based on feature similarity matching
CN115066674A (en) Method for evaluating source code using numeric array representation of source code elements
Hunter et al. Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation
CN110704308B (en) Multistage feature extraction method
EP4258107A1 (en) Method and system for automated discovery of artificial intelligence and machine learning assets in an enterprise
CN116975881A (en) LLVM (LLVM) -based vulnerability fine-granularity positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant