CN111324380A

CN111324380A - Efficient multi-version cross-project software code clone detection method

Info

Publication number: CN111324380A
Application number: CN202010122695.6A
Authority: CN
Inventors: 吴毅坚; 方维康
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23

Abstract

The invention belongs to the technical field of software code analysis, and particularly relates to an efficient multi-version cross-project software code clone detection method. The method comprises the steps of firstly obtaining version information of a software project containing a plurality of versions, then establishing a method version group by the same method with different versions and same or highly similar code contents based on method names and file paths, selecting the earliest version in each method version group as a sample method, wherein a set of the sample methods is called a history image, then carrying out clone detection on all the history images, and simultaneously establishing an index relation between the sample method and the method version group, which is called method index. And finally, recovering the original full-scale clone relation according to the clone detection result of the sample method and the method index. The invention considers that a plurality of versions of the project have a large amount of repeated codes, and shields the repeated codes during code clone detection, thereby improving the efficiency of multi-version cross-project code clone detection.

Description

Efficient multi-version cross-project software code clone detection method

Technical Field

The invention belongs to the technical field of software development, and particularly relates to an efficient multi-version cross-project software code clone detection method.

Background

The clone code is generated from various reasons, the copy and paste of the software code by software developers intentionally or unintentionally is the main reason of the clone code generation, and the clone is usually accompanied by some slight modifications, but the clone brings many hidden dangers while the code development speed is increased. To know the repeated codes in a complex software project and the association relationship with other software projects and improve the stability of the key codes of target software, a code clone detection technology and an automatic code analysis technology are required to be used for realizing the purpose. And the code clone information can assist code review and analysts to find the key codes of the target software system and form complete and comprehensive understanding, and lays a data foundation for deeply understanding the software system, thereby improving the robustness and stability of the system.

At present, researchers at home and abroad make a lot of research results on the code clone detection technology. Including but not limited to: baker et al propose a text-based clone detection method that treats source code as text and compares it with text behavior as a basic unit to detect clones. The CCFinder of Kamiya et al divides each line of source code into Token by a lexical analyzer, then converts the Token sequence, and finally performs clone detection on the converted Token sequence by using a suffix tree-based matching algorithm. Baxter et al propose an abstract syntax tree detection method that parses the source code into a syntax tree for clone analysis, thereby preserving more of the syntax structure. Ferrante et al proposed a program dependence graph-based clone detection method that converts source code into a program dependence graph and then detects clones using the similarity between the graph and the graph.

The current academic clone detection techniques or tools are mainly focused on clone detection algorithms themselves, however, for some items containing multiple versions, because of the large number of repeated codes between versions, no matter what clone detection tool is adopted, a large number of unnecessary repeated detections inevitably exist. In order to detect cloning more efficiently, the method is provided which is more suitable for cloning detection of multi-version projects than the traditional code cloning detection method in consideration of regularity of repeated codes among versions, and the cloning detection efficiency of the multi-version projects is greatly improved.

Disclosure of Invention

The invention aims to provide an efficient multi-version software project code clone detection method based on code mapping relation between versions in order to make up for the defects of the traditional method, wherein the traditional clone detection method has a plurality of unnecessary repeated detections in consideration of the existence of a large number of repeated codes in a multi-version project.

The invention provides a high-efficiency multi-version software project code clone detection method based on code mapping relation between versions, which mainly comprises the following steps: in order to have projects with multiple versions, method version groups are established by the same method with different versions and same or highly similar code content according to some heuristic rules, and then one method is selected from each method version group as a sample to participate in clone detection. And finally, recovering the clone detection result of the sample method into a complete clone detection result.

The method comprises the following specific steps:

a. acquiring historical version information of each software project to be analyzed, wherein the historical version information comprises a version name and release time;

b. for each project, (1) establish a method version group: firstly, establishing a method version group for the same method with different versions and the same or highly similar code content; (2) and (3) building a history image: then selecting the earliest version from all the method version groups as a sample method, wherein the set of the sample methods is called the history mapping of the item; (3) establishing a method index: finally, an index relationship between the sample method and the method version group in which the sample method is located is established, and the index is called a method index. If the project only has one version, the version is the history image of the project;

c. performing clone detection on the historical images of all the projects by adopting a code clone detection tool to obtain a clone detection result;

d. and (c) restoring the original full-scale clone relation by combining the obtained clone detection result with the method index stored in the step b.

In step a, the item to be detected is a set of items which are specified by a user and need to be subjected to code clone detection. This step requires the user to provide version information for these items.

The version information of the software to be analyzed comprises the names of all versions of the project and corresponding release time. The version information is stored in a predetermined format. The acquisition of version information may be derived directly by a version control tool, such as SVN, Git, etc., and version information may be added manually in a prescribed format if the project is not managed by the version control tool.

In step b, the establishing method version group is that for a single project, if the project has multiple versions, a great number of identical methods are likely to exist among the versions, and the methods generally exist under the same relative path and have consistent method names. According to the characteristics, the cost for establishing the method version group is very small, but the improvement on the subsequent clone detection efficiency is considerable. The specific process is that each version of the project is processed in turn according to the release time sequence of the versions as follows: firstly, extracting all methods in the current version, judging whether each method belongs to a certain method version group or not for each method, and skipping if the method belongs to the certain method version group; otherwise, establishing a new method version group, extracting the relative path between the method name and the file where the method is located, searching methods which are the same as the relative path and the method name and have highly similar texts in all subsequent versions, and adding the methods into the new method version group. Next, the earliest version in the set of all method versions is selected as the sample method, and the set of all sample methods in an item is referred to as the history map of the item. Finally, an index relationship between the sample methods and the groups of method versions is established, and the index is called a method index.

The judgment standard of the same method is based on the edit distance between method texts, and specifically comprises the following steps: for method A, B, method A, B is considered the same method if the ratio of the edit distance between method A, B text to the smaller of method A, B text length is less than 0.05, i.e., the text similarity between methods A, B exceeds 95%.

In step c, the clone detection is to detect the history images of each item, the detection range includes clones in the item and also includes clones between items, and the detection result is a clone group. And the detection tool is configurable, and can be developed by the existing detection tool or the detection tool.

In step d, the restoring of the original full-scale clone relationship refers to mapping from a partial clone relationship to a complete clone relationship according to the clone detection result of the history map of the item in combination with the method index.

Wherein, the full-scale clone relation refers to the result detected by a clone detection tool under the condition that the multi-version project is not processed additionally.

Compared with the prior art, the invention has the following advantages and positive effects: the invention provides an effective means for software maintenance personnel and software developers to understand the cloning relationship of the multi-version software system and other projects. Compared with the traditional code clone detection technology or tool which mainly focuses on the algorithm, the code clone detection method and device provided by the invention are different from the traditional code clone detection technology or tool which focuses on the algorithm, and the code quantity to be detected of the multi-version project is reduced from the structural characteristics of the multi-version project, so that the efficiency of multi-version cross-project code clone detection can be greatly improved.

Drawings

FIG. 1 is a schematic diagram of the basic process of the present invention. The method comprises the steps of extracting project version information, establishing a method version group, a history image and a method index, and performing clone detection and clone recovery.

FIG. 2 is a diagram of an exemplary implementation showing a specific process for performing clone detection for multiple release versions of a collection of software items.

Detailed Description

Further objects, specific structural features and advantages of the present invention will be understood from the following description of embodiments of the present invention, taken in conjunction with the accompanying drawings. Fig. 2 is a schematic diagram of an exemplary implementation.

In this embodiment, 251 Java open source items from different domains, which are more than 50 stars, have at least two release versions, and are selected from the GitHub, are used as clone detection object codes. The following is a specific implementation example for multi-version cross-project code clone detection of this software project collection.

The main process based on this embodiment is:

(1) analyzing a target item set to be detected and version information thereof according to a target item set path, a version information file path and the like specified by a user; extracting and storing all release version source codes of the edition management tool Git and storing all version information of the edition management tool Git to a database; considering that the present embodiment is directed only to Java code, only Java code files (. Java files) in the source code are reserved; 3234 total release versions, the total code line is about 3 hundred million lines;

(2) building a history image and establishing a method index; the method is adopted to construct the historical mapping of the 3234 selected release versions and store the method index, which takes 1129 seconds (in order to ensure the credible result, the average value is obtained through a plurality of constructing operations, the same is carried out below); the generated historical mapping of 251 items totals about 4 million lines of code, 788120 sample methods;

(3) cloning and detecting; carrying out cross-project clone detection on the historical image generated in the step (2) by using an existing code clone detection tool to obtain 644653 clone examples of 82595 clone groups, wherein 96 seconds are consumed;

(4) restoring the full clone relation; according to the detected clone groups and the established method indexes, the full clone relation is restored, and 3821507 examples are total.

And (4) analyzing results: the amount of the history image code generated in the second step is about 4 million lines in total, and compared with the amount of the original all release versions code (3 hundred million lines), the amount of the code is reduced by about 87%. The third step takes 96 seconds to perform clone checking on the historical map. In contrast, we also performed code clone testing on the same set of code (3234 release versions of the original 3 hundred million lines of code) using the same code clone testing tool, taking about 5800 seconds. Therefore, the method greatly shortens the overall clone detection time and improves the efficiency of multi-version cross-project code clone detection.

Claims

1. An efficient multi-version cross-project software code clone detection method is characterized by comprising the following specific steps:

a. acquiring historical version information of each software project to be analyzed;

b. for each project, a method version set is first established: establishing a method version group for the same method with different versions and same or highly similar code content; then, a history map is built: selecting the earliest version from all the method version groups as a sample method, wherein the set of the sample methods is called a history image of the item; and finally establishing a method index: establishing an index relation between a sample method and a method version group where the sample method is located, wherein the index is called a method index;

2. The method according to claim 1, wherein in step a, the version information of the software to be analyzed includes names of all versions of the project and corresponding release times; the version information is stored in a predetermined format.

3. The method according to claim 1, wherein in step b, the specific process of establishing the method version group is to sequentially perform the following processing on each version of the project according to the release time sequence of the versions: firstly, extracting all methods in the current version, judging whether each method belongs to a certain method version group or not for each method, and skipping if the method belongs to the certain method version group; otherwise, establishing a new method version group, extracting the relative path between the method name and the file where the method is located, searching methods which are the same as the relative path and the method name and have highly similar texts in all subsequent versions, and adding the methods into the new method version group; then, selecting the earliest version in all method version groups as a sample method, wherein the set of all sample methods in a project is called a history image of the project; finally, an index relationship between the sample methods and the groups of method versions is established, and the index is called a method index.

4. The method according to claim 3, wherein in step b, the criterion of the same method is an edit distance between method texts, specifically: for method a and method B, if the ratio of the edit distance between the texts of method a and method B to the smaller of the text lengths of method a and method B is less than 0.05, i.e., the text similarity between method a and method B exceeds 95%, then method a and method B are considered to be the same method.

5. The method according to claim 3, wherein in step c, the clone detection is performed on the history map of each item, the detection range includes the clone within the item and among the items, and the detection result is a clone group; also, the detection tool is configurable.

6. The method according to any one of claims 1 to 5, wherein in step d, the restoring of the original full-scale clonal relationship refers to mapping from a partial clonal relationship to a full clonal relationship in combination with a method index based on a clonal detection result of a history map of an item;

wherein, the full-scale clone relation refers to the result detected by a clone detection tool under the condition that the multi-version project is not additionally processed.