CN106919403B - multi-granularity code clone detection method based on Java byte codes in cloud environment - Google Patents

multi-granularity code clone detection method based on Java byte codes in cloud environment Download PDF

Info

Publication number
CN106919403B
CN106919403B CN201710156441.4A CN201710156441A CN106919403B CN 106919403 B CN106919403 B CN 106919403B CN 201710156441 A CN201710156441 A CN 201710156441A CN 106919403 B CN106919403 B CN 106919403B
Authority
CN
China
Prior art keywords
code
similarity
instruction
java
clone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710156441.4A
Other languages
Chinese (zh)
Other versions
CN106919403A (en
Inventor
俞东进
陈耀旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Lujie Technology Co., Ltd.
Original Assignee
Hangzhou Lujie Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Lujie Technology Co Ltd filed Critical Hangzhou Lujie Technology Co Ltd
Priority to CN201710156441.4A priority Critical patent/CN106919403B/en
Publication of CN106919403A publication Critical patent/CN106919403A/en
Application granted granted Critical
Publication of CN106919403B publication Critical patent/CN106919403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a multi-granularity code clone detection method based on Java byte codes in a cloud environment. The invention extracts the codes of the block granularity by analyzing the Java byte code instruction, thereby being capable of simultaneously detecting the clone of the method granularity and the clone of the block granularity. When the code similarity is calculated, the similarity between instructions is considered, and the similarity between method calls is introduced, so that semantic clone can be better detected. Compared with the traditional code clone detection method based on Java byte codes, the code clone detection method based on Java byte codes can simultaneously detect the code clones with the method granularity and the block granularity, and the clone detection result is more accurate due to the addition of similarity comparison of method calling.

Description

Multi-granularity code clone detection method based on Java byte codes in cloud environment
Technical Field
the invention belongs to the technical field of code clone detection in software analysis, and particularly relates to a multi-granularity code clone detection method based on Java byte codes in a cloud environment.
Background
The appearance of cloud computing provides a novel cooperative working mode for traditional software development. In this mode, different software teams and individuals are distributed in different places, and contribute to jointly developing the same software system. The cloud environment-based collaborative development mode brings convenience to software developers and brings certain difficulty to software management, and particularly due to the difference of regions and teams, effective supervision is lacked for the behavior of code multiplexing of developers through copying-pasting or adding a small amount of modification. These repeated codes (or code clones) can seriously affect the maintenance of the software. By detecting code clone in the software source code, unnecessary repeated codes can be found, thereby providing support for subsequent code reconstruction.
the existing code clone detection technologies mainly comprise several categories based on Text (Text), Token, Abstract Syntax Tree (AST), Program Dependency Graph (PDG) and metric values. The Text-based method does not need to convert source codes, and has high accuracy but low recall rate; the method based on Token has high detection speed and is not dependent on development language, but is difficult to detect Type-3 code clone; the AST-based method and the PDG-based method need to convert a source code into the AST or the PDG and then carry out corresponding comparison, and the cost of the two methods is relatively high; since the metric values corresponding to different code segments may be the same, the false alarm rate of the metric value-based method is high.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a multi-granularity code clone detection method based on Java byte codes in a cloud environment.
The method comprises the following specific steps:
Step (1) Java source codes distributed in different places in a cloud environment are obtained, a class file is obtained by compiling, and the class file is converted into a txt format file through a Java command; each txt format file contains one or more Java methods, each method consisting of a series of bytecode instructions and associated method calls;
Extracting the methods and the code blocks in the txt format file obtained in the step (1), wherein the byte code instruction extraction method corresponding to each method is directly read; the extraction of the code block requires analysis of some control transfer instructions, such as the bytecode instructions related to goto, switch, and if;
And (3) constructing a classification level architecture of the byte code instructions according to the classification of the Java byte code instructions and coding, wherein the architecture is mainly divided into three layers: the first layer is nine major classes except Type in the Java byte code instruction, the second layer is detailed subclass division for each major class, and the third layer is the corresponding byte code instruction;
Step (4) performs feature extraction on the code segments (methods or code blocks) with two granularities (method granularity and code block granularity) extracted in step (2), wherein the main extracted features comprise an instruction sequence IS ═ (I)1,I2,I3,...,Ii,...,Ik) And method call sequence MCS ═ { M ═ M1,M2,M3,...Mi,...,Mrand k and r are natural numbers.
after the characteristics are extracted, a classification level architecture of byte code instructions is used for carrying out normalization processing on the instruction sequences, and the instruction sequences are converted into a primary instruction sequence and a secondary instruction sequence, wherein the primary instruction sequence corresponds to a first layer of Java byte code instruction large class in the classification level architecture, and the secondary instruction sequence corresponds to a second layer of Java byte code instruction subclass in the classification level architecture;
And (5) for two code segments (methods or code blocks), respectively carrying out Type-1 clone detection, Type-2 clone detection and Type-3 clone detection by using the instruction sequence and the method calling sequence obtained in the step (4).
In Type-1 and Type-2 clone detection, hash values of two secondary instruction sequences of two code segments are calculated firstly, if the hash values are equal, method calling sequences of the two code segments are compared, and if the number of methods in the method calling sequences is the same and the number of parameters of each method is also the same, the two code segments are code clone examples.
in Type-3 clone detection, calculating the similarity of a primary instruction sequence, the similarity of a secondary instruction sequence and a method calling sequence of two code segments, weighting the similarity of the primary instruction sequence, the similarity of the secondary instruction sequence and the similarity of the method calling sequence to determine the final similarity value of the code segments, and if the similarity value exceeds a preset threshold value, determining the similarity values as code clone examples; wherein, the first order sequence similarity and the second order sequence similarity are calculated by using the edit distance; when the similarity of the two method calling sequences is calculated, the similarity calculation of the two method calling sequences is converted into the calculation of the similarity of the code segment corresponding to each method in the two method calling sequences in a recursive mode;
The invention provides a multi-granularity code clone detection method based on Java byte codes in a cloud environment, which consists of a group of functional modules, wherein the functional modules comprise: the device comprises a preprocessing module, a feature extraction module and a code clone detection module.
The preprocessing module compiles Java source codes distributed in different places in the cloud environment, converts the Java source codes into class files and then converts the class files into byte code files in the txt format. And then, extracting code segments with method granularity and block granularity on the basis of the byte code file.
the characteristic extraction module is used for extracting the instruction sequence and the method calling sequence in the code segment and carrying out normalization processing on the extracted instruction sequence.
The code clone detection module adopts different methods to detect the Type-1 clone, the Type-2 clone and the Type-3 clone. The Type-1 and Type-2 clone detection is mainly judged by comparing the hash value corresponding to the instruction sequence and the number of parameters called by each method, and the Type-3 clone detection is mainly judged by comparing the similarity of the instruction sequence and the similarity of the method calling sequence.
According to the multi-granularity code cloning method based on the Java byte codes in the cloud environment, codes of the block granularity are extracted by analyzing Java byte code instructions, so that the method granularity cloning and the block granularity cloning can be detected simultaneously. When the code similarity is calculated, the similarity between instructions is considered, and the similarity between method calls is introduced, so that semantic clone can be better detected. Compared with the traditional code clone detection method based on Java byte codes, the code clone detection method based on Java byte codes can simultaneously detect the code clones with the method granularity and the block granularity, and the clone detection result is more accurate due to the addition of similarity comparison of method calling.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a diagram of an example of code block extraction;
FIG. 3 is a block diagram of a byte code classification hierarchy;
FIG. 4 is a feature extraction graph;
FIG. 5 is a flow chart of clone detection.
Detailed Description
The specific implementation of the method for detecting the clone of the multi-granularity code based on the Java bytecode in the cloud environment mainly comprises 3 steps (as shown in figure 1):
1) Pretreatment of
The pretreatment stage mainly comprises the following two steps:
(1) source code compilation
Firstly, compiling Java source codes distributed in different places in a cloud environment into class files, and converting the class files into Java byte code files in txt format to be used as input of next cloning detection. Each Java bytecode file contains one or more methods, each of which is made up of a series of instructions and associated method calls. The empty rows and the comments can be automatically processed during Java compiling, and the influence of variable renaming is removed, so that the empty rows and the comments and variable names do not need to be preprocessed.
(2) Method extraction and code block extraction
Since the Java bytecode file separately parses each Java method into a section of bytecode instruction, the bytecode instruction corresponding to each method is directly read when extracting the code segment of the method granularity, and the control transfer instruction of the Java bytecode needs to be analyzed when extracting the code segment of the block granularity. The specific analysis method is (taking fig. 2 as an example):
a) Reading a byte code instruction corresponding to the code segment, reading the current number and the jump number of the instruction if the goto instruction is met, and adding an instruction between the next number of the current number and the jump number into the gotoSet.
b) If an if instruction is met, if the current number is greater than the jump number, removing the instruction with the current number being the same as the jump number in the gotoSet, and adding the instruction between the current number and the last number of the jump number into the ifReverseSet, otherwise, adding the ifSet.
c) If a branch instruction (tableswitch or lookup switch) of the switch is encountered when the bytecode instruction is read, the corresponding jump instruction number is extracted and added into numArr, and then the number in the numArr is traversed from small to large, and the bytecode instruction corresponding to the number is extracted. Step c) is recursively invoked if a switch is included in the process of extracting the block.
2) feature extraction
(1) in order to detect code clones on the basis of bytecodes, corresponding features need to be extracted from the bytecodes. The features extracted by the present invention are mainly the instruction sequence IS and the method call sequence MCS (as shown in fig. 3). The instruction sequence represents the execution flow of the source code, and the method calling sequence represents the method calling condition in the source code. When a method calling sequence is extracted, the parameter calling number of each method is saved for use in a clone detection stage.
(2) According to the classification hierarchy of bytecode instructions (as shown in fig. 4), the bytecode instruction sequence can be normalized, and the invention divides the normalization result into two sequences: NIS1 and NIS 2. Where NIS1 is the normalized result of the first layer of the sequence of bytecode instructions (referred to as a primary instruction sequence) and NIS2 is the normalized result of the second layer of the sequence of bytecode instructions (referred to as a secondary instruction sequence). For example, given a bytecode instruction sequence (aload _0, new, dup, ldc _ w, invokespecific, aload _1, invokevirtual, ldc _ w, invokevirtual, invokevirtual, iconst _4, invokevirtual), the normalization sequence (primary instruction sequence) of the first layer is: ADEAGAGDGGAG, the normalized sequence of the second layer (the secondary instruction sequence) is: A1D1E2A3G3A1G1D2G1G1A3G 1.
3) Code clone detection
The Type-3 clone test is separated from the Type-1 and Type-2 clone tests in the code clone test process to improve the efficiency of code clone test (as shown in FIG. 5).
in the Type-1 and Type-2 clone detection stage, firstly, judging whether the secondary instruction sequences of the two code segments are the same, if so, comparing whether the number of each method in the method calling sequence and the number of parameters of the method are the same, and if so, judging the Type-1 and Type-2 clones. In the stage of comparing the secondary instruction sequences, in order to accelerate the comparison speed, the characteristics of Type-1 clone and Type-2 clone are combined at the same time, a hash value is calculated for the secondary instruction sequence of each code segment, and if the hash values of the two secondary instruction sequences are the same, each method in the method calling sequence and the number of parameters of the method are further compared. Since the parameters of the method calls cloned in Type-1 and Type2 are the same, the consistency of the number of parameter calls is taken into account when comparing the sequences of method calls. The hash algorithm used here is a Java self-contained hash method.
And in the Type-3 clone detection stage, similarity comparison is respectively carried out on the instruction sequence and the method calling sequence, and then the similarity of the instruction calling sequence and the similarity of the method calling sequence are integrated to determine a final similarity value. If the total similarity value is greater than the minimum similarity threshold, the corresponding two code segments are code clones. At the time of instruction similarity calculation, the edit distance is used to calculate the similarity between instructions, where the similarity is a similarity weighted cumulative value of NIS1 and NIS 2. When the similarity of the method calling sequences is calculated, the similarity calculation of the two method calling sequences is converted into the similarity calculation of the code segment corresponding to each method in the two method calling sequences in a recursive mode.
The method can be used for code clone detection of the Java software system in the cloud environment, so that software developers are helped to better maintain and manage the software system.

Claims (6)

1. The method for detecting the clone of the multi-granularity code based on the Java byte code in the cloud environment is characterized by comprising the following steps of:
Step (1) Java source codes distributed in different places in a cloud environment are obtained, a class file is obtained by compiling, and the class file is converted into a txt format file through a Java command;
Extracting the methods and code blocks in the txt format file obtained in the step (1), wherein method granularity is formed by directly reading the byte code instruction extraction method corresponding to each method; analyzing and extracting code blocks by controlling transfer instructions to form code block granularity;
And (3) constructing a classification level architecture of the byte code instructions according to the classification of the Java byte code instructions and coding, wherein the architecture is mainly divided into three layers: the first layer is nine major classes except Type in the Java byte code instruction, the second layer is detailed subclass division for each major class, and the third layer is the corresponding byte code instruction;
Step (4) extracting features of the code segments with two granularities extracted in step (2), wherein the main extracted features comprise an instruction sequence IS ═ (I)1,I2,I3,...,Ii,...,Ik) And method call sequence MCS ═ { M ═ M1,M2,M3,...,Mi,...,Mrh, wherein i, k and r are natural numbers;
After the characteristics are extracted, normalizing the instruction sequence by using a byte code classification level architecture, and converting the instruction sequence into a primary instruction sequence and a secondary instruction sequence, wherein the primary instruction sequence corresponds to a first layer of Java byte code instruction large class in the classification level architecture, and the secondary instruction sequence corresponds to a second layer of Java byte code instruction subclass in the classification level architecture;
and (5) for the method or the code block, respectively carrying out Type-1 and Type-2 clone detection and Type-3 clone detection by using the instruction sequence and the method calling sequence obtained in the step (4).
2. the method for detecting cloning of multi-granular codes based on Java bytecode in cloud environment according to claim 1, wherein: each txt format file in the step (1) contains one or more Java methods, and each method is composed of a series of bytecode instructions and related method calls.
3. The method for detecting cloning of multi-granular codes based on Java bytecode in cloud environment according to claim 1, wherein: and (5) in the Type-1 and Type-2 clone detection, calculating hash values of secondary instruction sequences of the two code segments, if the hash values are equal, comparing the method calling sequences of the two code segments, and if the number of methods in the method calling sequences is the same and the number of parameters of each method is also the same, determining that the two code segments are code clone examples.
4. The method for detecting cloning of multi-granular codes based on Java bytecode in cloud environment according to claim 1, wherein: and (5) in the Type-3 clone detection, respectively calculating the similarity of a primary instruction sequence, the similarity of a secondary instruction sequence and the similarity of a method calling sequence of two code segments, endowing different weights to the similarity of the primary instruction sequence and the similarity of the secondary instruction sequence, accumulating to determine the similarity of the instruction sequences, endowing different weights to the similarity of the instruction sequences and the similarity of the method calling sequence, accumulating to determine a final code segment similarity value, and if the final similarity value exceeds a preset threshold value, determining the final code segment similarity value as a code clone example.
5. the method for detecting cloning of multi-granular codes based on Java bytecode in cloud environment according to claim 4, wherein: and calculating the similarity of the primary instruction sequence and the similarity of the secondary instruction sequence by using the editing distance.
6. The method for detecting cloning of multi-granular codes based on Java bytecode in cloud environment according to claim 4, wherein: when the similarity of the two method calling sequences is calculated, the similarity calculation of the two method calling sequences is converted into the calculation of the similarity of the code segment corresponding to each method in the two method calling sequences in a recursive mode.
CN201710156441.4A 2017-03-16 2017-03-16 multi-granularity code clone detection method based on Java byte codes in cloud environment Active CN106919403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710156441.4A CN106919403B (en) 2017-03-16 2017-03-16 multi-granularity code clone detection method based on Java byte codes in cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710156441.4A CN106919403B (en) 2017-03-16 2017-03-16 multi-granularity code clone detection method based on Java byte codes in cloud environment

Publications (2)

Publication Number Publication Date
CN106919403A CN106919403A (en) 2017-07-04
CN106919403B true CN106919403B (en) 2019-12-13

Family

ID=59461919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710156441.4A Active CN106919403B (en) 2017-03-16 2017-03-16 multi-granularity code clone detection method based on Java byte codes in cloud environment

Country Status (1)

Country Link
CN (1) CN106919403B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976806B (en) * 2019-01-03 2022-06-14 杭州电子科技大学 Java statement block clone detection method based on byte code sequence matching
CN111240740B (en) * 2020-01-23 2021-09-17 复旦大学 Code clone hazard assessment method based on evolution history analysis
CN111324380A (en) * 2020-02-27 2020-06-23 复旦大学 Efficient multi-version cross-project software code clone detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262047A (en) * 2010-12-15 2013-08-21 微软公司 Intelligent code differencing using code clone detection
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8868987B2 (en) * 2010-02-05 2014-10-21 Tripwire, Inc. Systems and methods for visual correlation of log events, configuration changes and conditions producing alerts in a virtual infrastructure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262047A (en) * 2010-12-15 2013-08-21 微软公司 Intelligent code differencing using code clone detection
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于索引和序列匹配的代码克隆检测技术研究";舒翔;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151031(第10期);I138-130 *

Also Published As

Publication number Publication date
CN106919403A (en) 2017-07-04

Similar Documents

Publication Publication Date Title
CN111125716B (en) Method and device for detecting Ethernet intelligent contract vulnerability
CN112733137B (en) Binary code similarity analysis method for vulnerability detection
CN110543421B (en) Unit test automatic execution method based on test case automatic generation algorithm
CN108268777B (en) Similarity detection method for carrying out unknown vulnerability discovery by using patch information
CN104899147B (en) A kind of code Static Analysis Method towards safety inspection
CN110737899A (en) machine learning-based intelligent contract security vulnerability detection method
CN111177733B (en) Software patch detection method and device based on data flow analysis
CN112394942B (en) Distributed software development compiling method and software development platform based on cloud computing
CN107229563A (en) A kind of binary program leak function correlating method across framework
CN106919403B (en) multi-granularity code clone detection method based on Java byte codes in cloud environment
CN108549535B (en) Efficient program analysis method and system based on file dependency relationship
CN109976806B (en) Java statement block clone detection method based on byte code sequence matching
US20230418578A1 (en) Systems and methods for detection of code clones
CN109902487B (en) Android application malicious property detection method based on application behaviors
CN112948828A (en) Binary program malicious code detection method, terminal device and storage medium
Hua et al. On the effectiveness of deep vulnerability detectors to simple stupid bug detection
CN113468524B (en) RASP-based machine learning model security detection method
CN113886832A (en) Intelligent contract vulnerability detection method, system, computer equipment and storage medium
CN116305131B (en) Static confusion removing method and system for script
CN115688108B (en) Webshell static detection method and system
CN116796323A (en) Intelligent contract reentry attack detection method, system and terminal equipment
KR20150133498A (en) Signature generation apparatus for generating signature of program and the method, malicious code detection apparatus for detecting malicious code of signature and the method
CN111752586A (en) Method and system for detecting unrepaired bugs of cross-architecture embedded equipment firmware
CN108536585B (en) Data change influence domain analysis method
CN116401670A (en) Vulnerability patch existence detection method and system in passive code scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20191118

Address after: 310051 Room 1201, Yinfeng Building, 1505 Binsheng Road, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou Lujie Technology Co., Ltd.

Address before: 310018 Hangzhou economic and Technological Development Zone of Zhejiang province and 1215 rooms of Dacheng 1

Applicant before: Hangzhou Fang Cheng Mdt InfoTech Ltd

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant