CN111324380A - Efficient multi-version cross-project software code clone detection method - Google Patents

Efficient multi-version cross-project software code clone detection method Download PDF

Info

Publication number
CN111324380A
CN111324380A CN202010122695.6A CN202010122695A CN111324380A CN 111324380 A CN111324380 A CN 111324380A CN 202010122695 A CN202010122695 A CN 202010122695A CN 111324380 A CN111324380 A CN 111324380A
Authority
CN
China
Prior art keywords
version
clone
project
index
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010122695.6A
Other languages
Chinese (zh)
Inventor
吴毅坚
方维康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010122695.6A priority Critical patent/CN111324380A/en
Publication of CN111324380A publication Critical patent/CN111324380A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Abstract

The invention belongs to the technical field of software code analysis, and particularly relates to an efficient multi-version cross-project software code clone detection method. The method comprises the steps of firstly obtaining version information of a software project containing a plurality of versions, then establishing a method version group by the same method with different versions and same or highly similar code contents based on method names and file paths, selecting the earliest version in each method version group as a sample method, wherein a set of the sample methods is called a history image, then carrying out clone detection on all the history images, and simultaneously establishing an index relation between the sample method and the method version group, which is called method index. And finally, recovering the original full-scale clone relation according to the clone detection result of the sample method and the method index. The invention considers that a plurality of versions of the project have a large amount of repeated codes, and shields the repeated codes during code clone detection, thereby improving the efficiency of multi-version cross-project code clone detection.

Description

Efficient multi-version cross-project software code clone detection method
Technical Field
The invention belongs to the technical field of software development, and particularly relates to an efficient multi-version cross-project software code clone detection method.
Background
The clone code is generated from various reasons, the copy and paste of the software code by software developers intentionally or unintentionally is the main reason of the clone code generation, and the clone is usually accompanied by some slight modifications, but the clone brings many hidden dangers while the code development speed is increased. To know the repeated codes in a complex software project and the association relationship with other software projects and improve the stability of the key codes of target software, a code clone detection technology and an automatic code analysis technology are required to be used for realizing the purpose. And the code clone information can assist code review and analysts to find the key codes of the target software system and form complete and comprehensive understanding, and lays a data foundation for deeply understanding the software system, thereby improving the robustness and stability of the system.
At present, researchers at home and abroad make a lot of research results on the code clone detection technology. Including but not limited to: baker et al propose a text-based clone detection method that treats source code as text and compares it with text behavior as a basic unit to detect clones. The CCFinder of Kamiya et al divides each line of source code into Token by a lexical analyzer, then converts the Token sequence, and finally performs clone detection on the converted Token sequence by using a suffix tree-based matching algorithm. Baxter et al propose an abstract syntax tree detection method that parses the source code into a syntax tree for clone analysis, thereby preserving more of the syntax structure. Ferrante et al proposed a program dependence graph-based clone detection method that converts source code into a program dependence graph and then detects clones using the similarity between the graph and the graph.
The current academic clone detection techniques or tools are mainly focused on clone detection algorithms themselves, however, for some items containing multiple versions, because of the large number of repeated codes between versions, no matter what clone detection tool is adopted, a large number of unnecessary repeated detections inevitably exist. In order to detect cloning more efficiently, the method is provided which is more suitable for cloning detection of multi-version projects than the traditional code cloning detection method in consideration of regularity of repeated codes among versions, and the cloning detection efficiency of the multi-version projects is greatly improved.
Disclosure of Invention
The invention aims to provide an efficient multi-version software project code clone detection method based on code mapping relation between versions in order to make up for the defects of the traditional method, wherein the traditional clone detection method has a plurality of unnecessary repeated detections in consideration of the existence of a large number of repeated codes in a multi-version project.
The invention provides a high-efficiency multi-version software project code clone detection method based on code mapping relation between versions, which mainly comprises the following steps: in order to have projects with multiple versions, method version groups are established by the same method with different versions and same or highly similar code content according to some heuristic rules, and then one method is selected from each method version group as a sample to participate in clone detection. And finally, recovering the clone detection result of the sample method into a complete clone detection result.
The method comprises the following specific steps:
a. acquiring historical version information of each software project to be analyzed, wherein the historical version information comprises a version name and release time;
b. for each project, (1) establish a method version group: firstly, establishing a method version group for the same method with different versions and the same or highly similar code content; (2) and (3) building a history image: then selecting the earliest version from all the method version groups as a sample method, wherein the set of the sample methods is called the history mapping of the item; (3) establishing a method index: finally, an index relationship between the sample method and the method version group in which the sample method is located is established, and the index is called a method index. If the project only has one version, the version is the history image of the project;
c. performing clone detection on the historical images of all the projects by adopting a code clone detection tool to obtain a clone detection result;
d. and (c) restoring the original full-scale clone relation by combining the obtained clone detection result with the method index stored in the step b.
In step a, the item to be detected is a set of items which are specified by a user and need to be subjected to code clone detection. This step requires the user to provide version information for these items.
The version information of the software to be analyzed comprises the names of all versions of the project and corresponding release time. The version information is stored in a predetermined format. The acquisition of version information may be derived directly by a version control tool, such as SVN, Git, etc., and version information may be added manually in a prescribed format if the project is not managed by the version control tool.
In step b, the establishing method version group is that for a single project, if the project has multiple versions, a great number of identical methods are likely to exist among the versions, and the methods generally exist under the same relative path and have consistent method names. According to the characteristics, the cost for establishing the method version group is very small, but the improvement on the subsequent clone detection efficiency is considerable. The specific process is that each version of the project is processed in turn according to the release time sequence of the versions as follows: firstly, extracting all methods in the current version, judging whether each method belongs to a certain method version group or not for each method, and skipping if the method belongs to the certain method version group; otherwise, establishing a new method version group, extracting the relative path between the method name and the file where the method is located, searching methods which are the same as the relative path and the method name and have highly similar texts in all subsequent versions, and adding the methods into the new method version group. Next, the earliest version in the set of all method versions is selected as the sample method, and the set of all sample methods in an item is referred to as the history map of the item. Finally, an index relationship between the sample methods and the groups of method versions is established, and the index is called a method index.
The judgment standard of the same method is based on the edit distance between method texts, and specifically comprises the following steps: for method A, B, method A, B is considered the same method if the ratio of the edit distance between method A, B text to the smaller of method A, B text length is less than 0.05, i.e., the text similarity between methods A, B exceeds 95%.
In step c, the clone detection is to detect the history images of each item, the detection range includes clones in the item and also includes clones between items, and the detection result is a clone group. And the detection tool is configurable, and can be developed by the existing detection tool or the detection tool.
In step d, the restoring of the original full-scale clone relationship refers to mapping from a partial clone relationship to a complete clone relationship according to the clone detection result of the history map of the item in combination with the method index.
Wherein, the full-scale clone relation refers to the result detected by a clone detection tool under the condition that the multi-version project is not processed additionally.
Compared with the prior art, the invention has the following advantages and positive effects: the invention provides an effective means for software maintenance personnel and software developers to understand the cloning relationship of the multi-version software system and other projects. Compared with the traditional code clone detection technology or tool which mainly focuses on the algorithm, the code clone detection method and device provided by the invention are different from the traditional code clone detection technology or tool which focuses on the algorithm, and the code quantity to be detected of the multi-version project is reduced from the structural characteristics of the multi-version project, so that the efficiency of multi-version cross-project code clone detection can be greatly improved.
Drawings
FIG. 1 is a schematic diagram of the basic process of the present invention. The method comprises the steps of extracting project version information, establishing a method version group, a history image and a method index, and performing clone detection and clone recovery.
FIG. 2 is a diagram of an exemplary implementation showing a specific process for performing clone detection for multiple release versions of a collection of software items.
Detailed Description
Further objects, specific structural features and advantages of the present invention will be understood from the following description of embodiments of the present invention, taken in conjunction with the accompanying drawings. Fig. 2 is a schematic diagram of an exemplary implementation.
In this embodiment, 251 Java open source items from different domains, which are more than 50 stars, have at least two release versions, and are selected from the GitHub, are used as clone detection object codes. The following is a specific implementation example for multi-version cross-project code clone detection of this software project collection.
The main process based on this embodiment is:
(1) analyzing a target item set to be detected and version information thereof according to a target item set path, a version information file path and the like specified by a user; extracting and storing all release version source codes of the edition management tool Git and storing all version information of the edition management tool Git to a database; considering that the present embodiment is directed only to Java code, only Java code files (. Java files) in the source code are reserved; 3234 total release versions, the total code line is about 3 hundred million lines;
(2) building a history image and establishing a method index; the method is adopted to construct the historical mapping of the 3234 selected release versions and store the method index, which takes 1129 seconds (in order to ensure the credible result, the average value is obtained through a plurality of constructing operations, the same is carried out below); the generated historical mapping of 251 items totals about 4 million lines of code, 788120 sample methods;
(3) cloning and detecting; carrying out cross-project clone detection on the historical image generated in the step (2) by using an existing code clone detection tool to obtain 644653 clone examples of 82595 clone groups, wherein 96 seconds are consumed;
(4) restoring the full clone relation; according to the detected clone groups and the established method indexes, the full clone relation is restored, and 3821507 examples are total.
And (4) analyzing results: the amount of the history image code generated in the second step is about 4 million lines in total, and compared with the amount of the original all release versions code (3 hundred million lines), the amount of the code is reduced by about 87%. The third step takes 96 seconds to perform clone checking on the historical map. In contrast, we also performed code clone testing on the same set of code (3234 release versions of the original 3 hundred million lines of code) using the same code clone testing tool, taking about 5800 seconds. Therefore, the method greatly shortens the overall clone detection time and improves the efficiency of multi-version cross-project code clone detection.

Claims (6)

1. An efficient multi-version cross-project software code clone detection method is characterized by comprising the following specific steps:
a. acquiring historical version information of each software project to be analyzed;
b. for each project, a method version set is first established: establishing a method version group for the same method with different versions and same or highly similar code content; then, a history map is built: selecting the earliest version from all the method version groups as a sample method, wherein the set of the sample methods is called a history image of the item; and finally establishing a method index: establishing an index relation between a sample method and a method version group where the sample method is located, wherein the index is called a method index;
c. performing clone detection on the historical images of all the projects by adopting a code clone detection tool to obtain a clone detection result;
d. and (c) restoring the original full-scale clone relation by combining the obtained clone detection result with the method index stored in the step b.
2. The method according to claim 1, wherein in step a, the version information of the software to be analyzed includes names of all versions of the project and corresponding release times; the version information is stored in a predetermined format.
3. The method according to claim 1, wherein in step b, the specific process of establishing the method version group is to sequentially perform the following processing on each version of the project according to the release time sequence of the versions: firstly, extracting all methods in the current version, judging whether each method belongs to a certain method version group or not for each method, and skipping if the method belongs to the certain method version group; otherwise, establishing a new method version group, extracting the relative path between the method name and the file where the method is located, searching methods which are the same as the relative path and the method name and have highly similar texts in all subsequent versions, and adding the methods into the new method version group; then, selecting the earliest version in all method version groups as a sample method, wherein the set of all sample methods in a project is called a history image of the project; finally, an index relationship between the sample methods and the groups of method versions is established, and the index is called a method index.
4. The method according to claim 3, wherein in step b, the criterion of the same method is an edit distance between method texts, specifically: for method a and method B, if the ratio of the edit distance between the texts of method a and method B to the smaller of the text lengths of method a and method B is less than 0.05, i.e., the text similarity between method a and method B exceeds 95%, then method a and method B are considered to be the same method.
5. The method according to claim 3, wherein in step c, the clone detection is performed on the history map of each item, the detection range includes the clone within the item and among the items, and the detection result is a clone group; also, the detection tool is configurable.
6. The method according to any one of claims 1 to 5, wherein in step d, the restoring of the original full-scale clonal relationship refers to mapping from a partial clonal relationship to a full clonal relationship in combination with a method index based on a clonal detection result of a history map of an item;
wherein, the full-scale clone relation refers to the result detected by a clone detection tool under the condition that the multi-version project is not additionally processed.
CN202010122695.6A 2020-02-27 2020-02-27 Efficient multi-version cross-project software code clone detection method Pending CN111324380A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010122695.6A CN111324380A (en) 2020-02-27 2020-02-27 Efficient multi-version cross-project software code clone detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010122695.6A CN111324380A (en) 2020-02-27 2020-02-27 Efficient multi-version cross-project software code clone detection method

Publications (1)

Publication Number Publication Date
CN111324380A true CN111324380A (en) 2020-06-23

Family

ID=71169176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010122695.6A Pending CN111324380A (en) 2020-02-27 2020-02-27 Efficient multi-version cross-project software code clone detection method

Country Status (1)

Country Link
CN (1) CN111324380A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681835A (en) * 2010-12-20 2012-09-19 微软公司 Code clone notification and architectural change visualization
CN106919403A (en) * 2017-03-16 2017-07-04 杭州承方信息科技有限公司 Many granularity Code Clones detection methods based on Java bytecode under cloud environment
US10402310B1 (en) * 2018-03-30 2019-09-03 Atlassian Pty Ltd Systems and methods for reducing storage required for code coverage results
CN110399162A (en) * 2019-07-09 2019-11-01 北京航空航天大学 A kind of source code annotation automatic generation method
CN110442847A (en) * 2019-07-26 2019-11-12 南京邮电大学 Code similarity detection method and device based on code storage process management

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681835A (en) * 2010-12-20 2012-09-19 微软公司 Code clone notification and architectural change visualization
CN106919403A (en) * 2017-03-16 2017-07-04 杭州承方信息科技有限公司 Many granularity Code Clones detection methods based on Java bytecode under cloud environment
US10402310B1 (en) * 2018-03-30 2019-09-03 Atlassian Pty Ltd Systems and methods for reducing storage required for code coverage results
CN110399162A (en) * 2019-07-09 2019-11-01 北京航空航天大学 A kind of source code annotation automatic generation method
CN110442847A (en) * 2019-07-26 2019-11-12 南京邮电大学 Code similarity detection method and device based on code storage process management

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李锁等: ""基于代码克隆检测的代码来源分析方法"", 《计算机应用与软件》 *

Similar Documents

Publication Publication Date Title
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
WO2021088385A1 (en) Online log analysis method, system, and electronic terminal device thereof
CN109063421B (en) Open source license compliance analysis and conflict detection method
CN102779249B (en) Malware detection methods and scanning engine
CN110569214B (en) Index construction method and device for log file and electronic equipment
CN110209828B (en) Case query method, case query device, computer device and storage medium
KR101617696B1 (en) Method and device for mining data regular expression
CN109492106B (en) Automatic classification method for defect reasons by combining text codes
CN103778185A (en) SQL statement parsing method and system used for database auditing system
CN101630315B (en) Quick retrieval method and system
CN111581638A (en) Security analysis method and device for open source software
CN104392171A (en) Automatic memory evidence analyzing method based on data association
CN106294139B (en) A kind of Detection and Extraction method of repeated fragment in software code
US20150100584A1 (en) Method, computer program and apparatus for analyzing symbols in a computer system
CN104462461B (en) The method and device of investigation processing empty value is carried out to list
CN108009298B (en) Internet character search information integration analysis control method
CN106295252A (en) Search method for gene prod
CN102799584A (en) Processing method for screening and extraction of output data of detection instrument
CN111324380A (en) Efficient multi-version cross-project software code clone detection method
CN108038124B (en) PDF document acquisition and processing method, system and device based on big data
CN107590233B (en) File management method and device
KR101268503B1 (en) Method and its system for generation of patent maps
CN103778210A (en) Method and device for judging specific file type of file to be analyzed
CN109754159B (en) Method and system for extracting information of power grid operation log
CN109472145A (en) A kind of code reuse recognition methods and system based on graph theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200623