CN109976806B - Java statement block clone detection method based on byte code sequence matching - Google Patents

Java statement block clone detection method based on byte code sequence matching Download PDF

Info

Publication number
CN109976806B
CN109976806B CN201910003382.6A CN201910003382A CN109976806B CN 109976806 B CN109976806 B CN 109976806B CN 201910003382 A CN201910003382 A CN 201910003382A CN 109976806 B CN109976806 B CN 109976806B
Authority
CN
China
Prior art keywords
scs
cell
instruction
bam
byte code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910003382.6A
Other languages
Chinese (zh)
Other versions
CN109976806A (en
Inventor
俞东进
杨加柞
孙笑笑
陈信
王琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910003382.6A priority Critical patent/CN109976806B/en
Publication of CN109976806A publication Critical patent/CN109976806A/en
Application granted granted Critical
Publication of CN109976806B publication Critical patent/CN109976806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a Java statement block clone detection method based on byte code sequence matching. And finally obtaining similar code segments by adopting a sequence matching and similarity calculation mode. The method accurately extracts the byte code segment at the statement block level by analyzing the execution and the jump of the instruction, and adopts the unique character to represent according to the function realized by the instruction, thereby improving the detection efficiency of code cloning. In the specific process of detecting clone codes, the sequence matching method is applied to a single-character instruction sequence, and compared with other traditional methods, the method has better detection and identification effects.

Description

Java statement block clone detection method based on byte code sequence matching
Technical Field
The invention belongs to the field of software engineering, and particularly relates to a Java statement block code clone detection technology based on byte code sequence matching.
Background
In the software development process, developers often multiplex code by copying-pasting or adding a small amount of modification, and the form of the multiplexed code is code cloning. Studies have shown that the proportion of these cloned codes in software is approximately between 7% and 23%. Code cloning is advantageous on the one hand to reduce the cost of development, but on the other hand it also brings many hazards, for example it introduces bugs in the original code fragments into the system and makes understanding and maintenance of the software difficult. Therefore, the code clone detection can better help people to find the clone code from the software system, thereby providing a basis for software maintenance, code management and the like.
Code clone detection is a hot research area in current software code analysis. In the field of software engineering, code clone detection has many application fields, such as program understanding, code quality analysis, software evolution analysis, error detection and the like, which all need to extract clone code segments in a software system, so that the code clone detection is called as an important and valuable part of the field of software analysis. However, most of the existing code clone detection methods are to identify clone code from the source code of the software system, such as text-based, token-based, abstract syntax tree-based, metric-based, and program dependency graph-based, and the granularity of detection is generally limited to the class and method level. In fact, since similar source code is likely to be compiled into the same bytecode, more accurate results can be achieved from bytecode identification directly. On the other hand, code cloning is more widely present at the statement block level than at the class and method level.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a Java statement block code clone detection method based on byte code sequence matching.
The method comprises the following specific steps:
respectively compiling two different Java source code files into byte codes, further converting the byte codes into byte code text format files, and extracting statement blocks p and q of the byte codes on the basis of the byte code text files;
step (2) extracting instruction sequences from the statement blocks p and q respectively, and representing each instruction by using a unique character according to different functions, thereby forming two single instruction character sequences SCS with the lengths of | p | and | q |, respectivelypAnd SCSq
Step (3) for two single character command sequence SCSpAnd SCSqConstructing a byte code matching matrix BAM of | p | +1 row and | q | +1 columnp,qAnd initializing BAMp,qThe matching score of each cell in (a) is 0;
step (4) calculating the byte code matching matrix BAM in sequence from the first row and the first column by row and then by columnp,qMatch score BAM of each cell (i, j) in (b)p,q[i][j]Is BAMp,q[i-1][j-1]+σ(SCSp[i]+SCSq[j])、BAMp,q[i-1][j]+σDelete、BAMp,q[i][j-1]+σInsertAnd 0; wherein if SCSpThe ith character and SCS inqThe j-th character in the sequence is the same, then sigma (SCS)p[i]+SCSq[j]) For Match, add cell (i, j) to the starting point set of the closed backtracking path, set it as unvisited, and set the predecessor cell of cell (i, j) according to the following rule: if matching score BAM p,q[i][j]Is BAMp,q[i-1][j-1]+σ(SCSp[i]+SCSq[j]) Then the predecessor cell is (i-1, j-1), if the value is BAMp,q[i-1][j]+σDeleteThen the predecessor cell is (i-1, j), if the value is BAMp,q[i][j-1]+σInsertIf the value of the predecessor cell is 0, null is set; when SCSpThe ith character and SCS inqIf the j-th character is not the same, then σ (SCS)p[i]+SCSq[j]) Is MisMatch, above Match, MisMatch, sigmaDeleteAnd σInsertThe values of (A) are respectively 2, -2 and-2;
selecting a cell with the highest matching score from all cells which are not accessed in a starting point set of a closed backtracking path, sequentially selecting precursor cells of the cell from the cell until the cell reaches a null cell of the precursor cells, thereby obtaining a closed backtracking path, setting each cell in the closed backtracking path as being accessed, and respectively and sequentially selecting row coordinates and column coordinates of each cell in the closed backtracking path to form two single character instruction sequences suspected to be cloned;
step (6) repeating step(5) To obtain SCSpAnd SCSqThe single character instruction subsequence pairs of all suspected clones are combined to form two three-dimensional vectors
Figure BDA0001934501270000021
Wherein p is1As SCSpMiddle, and SCSqThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, L pIs SCSpCorresponding unicode instruction sequence length, q1As SCSqMiddle, and SCSpThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, LqIs SCSqThe length of the corresponding single character instruction sequence; and calculating cosine similarity of the two three-dimensional vectors, namely cosine similarity of p and q, and if the cosine similarity is greater than a certain threshold, considering the statement blocks p and q as cloned byte code segments and mapping the cloned byte code segments back to the source code.
The Java statement block code clone detection method based on byte code sequence matching provided by the invention comprises a group of modules, and the modules comprise: the system comprises a code preprocessing module, a feature extraction and normalization module and a code clone detection module.
The code preprocessing module is used for extracting byte code blocks, firstly, an Oracle JDK and a batch compiler are adopted to carry out batch compilation on a source code file and convert the source code file into a text form, and then the byte code blocks are extracted according to the execution and jump conditions of instructions in the byte code file.
The characteristic extraction module extracts an instruction sequence from the byte code block, and each instruction is represented by adopting a unique character according to different functions.
The code clone detection module detects two single character instruction sequences SCS corresponding to any two byte code blocks p and q pAnd SCSqAs input, two single character command sequences SCSpAnd SCSqConstruction of bytecode matching matrix BAMp,qCalculating the matching score of each cell of the byte code matching matrix, constructing a closed backtracking path, acquiring similar subsequences, and calculating the distance between two byte code blocksAnd finally obtaining similar byte code blocks according to the similarity.
The method provided by the invention extracts the byte code segment at the statement block level by analyzing the execution and the jump of the instruction, then extracts the instruction from the byte code segment and adopts unique characters to represent according to different realized functions. And finally obtaining similar code segments by adopting a sequence matching and similarity calculation mode.
The method accurately extracts the byte code segment at the statement block level by analyzing the execution and the jump of the instruction, and adopts the unique character to represent according to the function realized by the instruction, thereby improving the detection efficiency of code cloning. In the specific process of detecting clone codes, the sequence matching method is applied to a single-character instruction sequence, and compared with other traditional methods, the method has better detection and identification effects.
Drawings
FIG. 1 is a workflow diagram of code clone detection;
FIG. 2 is a diagram of an example extraction of a byte code block;
fig. 3 is a flow diagram of byte code sequence matching.
Detailed Description
The specific implementation of the code clone detection method of multiple Java statement blocks based on byte code sequence matching mainly comprises 3 steps (as shown in FIGS. 1 and 3):
(1) according to two input Java files, an Oracle JDK and a batch compiler are adopted to compile a source code into a byte code file and further convert the byte code file into a text form file, and code segments at a statement block level are extracted from the byte code file according to the execution and jump conditions of instructions; (2) extracting an instruction sequence required by the method from the bytecode code segment at the statement block level, and expressing each instruction in the instruction sequence by adopting a unique character according to the realized function so as to form a single-character instruction sequence; (3) in the code clone detection stage, the single character instruction sequence is utilized to carry out sequence matching and similarity calculation between byte code fragments, and the similarity calculation is mapped back to the source code.
For convenience of description, the associated symbols are defined as follows:
ISBCF: instruction sequence in the Byte code fragment BCF No, denoted IS BCF=(I1,I2,...,Ii,...,In) In which IiRepresents the ith (1 ≦ i ≦ n) instruction in the byte code fragment BCF.
SCSBCF: single character instruction sequence in byte code fragment BCF, denoted SCSBCF=(S1,S2,...,Si,...,Sn) In which S isiIndicates the I (I ═ 1, 2.., n) th unicode, which is instruction IiUnique characters are employed according to the function implemented.
SLPBCF: i of a single character instruction sequence in a byte code fragment BCFthLeft prefix sequence, SCS for a single character sequenceBCF=(S1,S2,...,Si,...,Sn) Its left prefix sequence is SLPBCF=(S1,S2,...,Si)(1≤i≤n)。
BAMp,q: the bytecode matches the matrix. The cell scores in the ith row and the jth column in the matrix represent the matching scores of the ith row before the byte code segment p and the jth row before the byte code segment q. In particular, for a single character sequence SCS in two byte code segments p, qpAnd SCSq,BAMp,q[i][j]Representation of SCSpLeft prefix SLP ofi pAnd SCSqLeft prefix of
Figure BDA0001934501270000041
The match between them.
preCell: is the predecessor cell from which the cell (i, j) score originated.
(1) Code preprocessing of source code files
In order to detect code cloning on a bytecode code segment at a Java statement block level, a Java source code is compiled into a bytecode file through Oracle JDK and a batch compiler and converted into a text form file, and then the bytecode code segment at the statement block level is extracted according to the execution and jump conditions of instructions.
When extracting the bytecode code block at the statement block level, we analyze the control transfer instruction. If the instruction is a goto instruction, we will extract the byte code fragment between the next instruction of the instruction to the jump number instruction. If the instruction is an if-related instruction, comparing the current line number with the jump line number of the instruction, if the current line number is smaller than the jump line number, directly extracting a byte code segment between the two numbers, otherwise, extracting a byte code segment between the current line number and an instruction which is previous to the instruction and corresponds to the jump line number, and deleting a code segment of which the current line number is consistent with the jump line number of the instruction from a corresponding code segment extracted by a previous goto instruction. If the instruction is a switch instruction, we will get the largest number in its jump line number set and extract the byte code fragments from the current line number to the largest number, the whole process is shown in FIG. 2.
(2) Extracting instruction features from byte code fragments
After obtaining the bytecode fragment BCF, the instruction sequence IS IS extractedBCF=(I1,I2,...,Ii,...,In) For instruction sequence ISBCFIn other words, single character instruction sequence SCS is adopted BCF=(S1,S2,...,Si,...,Sn) To show that we use all lower case letters and ASCII characters as single characters.
(3) Sequence matching and similarity calculation
First, a single character sequence SCS is generated based on two byte code segments p and qpAnd SCSqConstruction of BAMp,qAnd initialized to 0, and then the value of each cell in the matching matrix is calculated by formula 1. Wherein,
Figure BDA0001934501270000042
is the score of the cell (i, j) in the bytecode matching matrix. If two single character sequences SCSpAnd SCSqThe ith ofCharacter and j-th character are the same, then σ (SCS)p[i]+SCSq[j]) If the value of (1) is Match, otherwise, is MisMatch, and the cell is added into the starting point set of the closed backtracking path and is set to be not accessed. Match, MisMatch, sigmaDeleteAnd σInsertRespectively takes values of 2, -2 and-2
Figure BDA0001934501270000051
Then, look for the predecessor cell preCell through equation 2, and select the highest scoring cell among all the unvisited cells from the set of close backtracking starting points. Starting from the unit, the precursor cells preCell are continuously traced until reaching the cells with the matching score of 0, then the row coordinates and the column coordinates of each cell in the closed tracing path are respectively and sequentially selected to form two single character instruction sequences suspected to be cloned
Figure BDA0001934501270000052
Repeating the above process to obtain SCSpAnd SCSqThe single character instruction subsequence pairs of all suspected clones are combined to form two three-dimensional vectors
Figure BDA0001934501270000053
Wherein p is1As SCSpMiddle, and SCSqThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, LpIs SCSpCorresponding unicode instruction sequence length, q1As SCSqMiddle, and SCSpThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, LqIs SCSqThe length of the corresponding single character instruction sequence; calculating cosine similarity of two three-dimensional vectors, namely cosine similarity of p and q, if the cosine similarity is greater than a certain threshold value, regarding the statement blocks p and q as cloned byte code segments, and mapping the cloned byte code segments back to the source generationAnd (4) code.

Claims (1)

1. A Java statement block clone detection method based on byte code sequence matching is characterized by comprising the following specific steps:
step 1, compiling two different Java source code files into byte codes respectively, further converting the byte codes into byte code text format files, and extracting statement blocks p and q of the byte codes on the basis of the byte code text files;
step 2, extracting instruction sequences from the statement blocks p and q respectively, representing each instruction by using a unique character according to different functions, thereby forming two single instruction character sequences SCS with the lengths of | p | and | q |, respectively pAnd SCSq
Step 3. for two single character command sequences SCSpAnd SCSqConstructing a byte code matching matrix BAM of | p | +1 row and | q | +1 columnp,qAnd initializing BAMp,qThe matching score of each cell in (a) is 0;
step 4, calculating byte code matching matrix BAM in sequence from first row and first column by row and then by columnp,qMatch score BAM of each cell (i, j) in (b)p,q[i][j]Is BAMp,q[i-1][j-1]+σ(SCSp[i],SCSq[j])、BAMp,q[i-1][j]+σDelete、BAMp,q[i][j-1]+σInsertAnd 0, the maximum of the four values; wherein if SCSpThe ith character and SCS inqThe j-th character in the sequence is the same, then sigma (SCS)p[i],SCSq[j]) For Match, adding the cell (i, j) into the starting point set of the closed backtracking path, setting the starting point set as unvisited, and then setting the predecessor cell of the cell (i, j) according to the following rules: if matching score BAMp,q[i][j]Is BAMp,q[i-1][j-1]+σ(SCSp[i],SCSq[j]) Then the predecessor cell is (i-1, j-1), if the value is BAMp,q[i-1][j]+σDeleteThen the predecessor cell is (i-1, j), if the value is BAMp,q[i][j-1]+σInsertIf the value of the precursor cell is 0, null is set; when SCSpThe ith character and SCS inqIf the j-th character is not the same, then σ (SCS)p[i],SCSq[j]) Is MisMatch, above Match, MisMatch, sigmaDeleteAnd σInsertThe values of (A) are respectively 2, -2 and-2;
step 5, selecting the cell with the highest matching score from all the unvisited cells in the starting point set of the closed backtracking path, sequentially selecting the precursor cells of the closed backtracking path from the cell until the cell with null precursor cells is reached, thereby obtaining a closed backtracking path, setting each cell in the closed backtracking path as visited, and respectively and sequentially selecting the row coordinate and the column coordinate of each cell in the closed backtracking path to form two single character instruction sequences suspected to be cloned;
Step 6, repeating the step 5 to obtain SCSpAnd SCSqAll the single character instruction subsequence pairs of suspected clone in the Chinese character library are combined, and two three-dimensional vectors are formed
Figure FDA0003607027550000021
Wherein p is1As SCSpMiddle, and SCSqThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, LpIs SCSpCorresponding unicode instruction sequence length, q1As SCSqMiddle, and SCSpThe subsequences are the sum of the lengths of all single character instruction subsequences of suspected clone pairs, LqIs SCSqThe length of the corresponding single character instruction sequence; and calculating cosine similarity of the two three-dimensional vectors, namely cosine similarity of p and q, and if the cosine similarity is greater than a certain threshold, considering the statement blocks p and q as cloned byte code segments and mapping the cloned byte code segments back to the source code.
CN201910003382.6A 2019-01-03 2019-01-03 Java statement block clone detection method based on byte code sequence matching Active CN109976806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910003382.6A CN109976806B (en) 2019-01-03 2019-01-03 Java statement block clone detection method based on byte code sequence matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910003382.6A CN109976806B (en) 2019-01-03 2019-01-03 Java statement block clone detection method based on byte code sequence matching

Publications (2)

Publication Number Publication Date
CN109976806A CN109976806A (en) 2019-07-05
CN109976806B true CN109976806B (en) 2022-06-14

Family

ID=67076465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910003382.6A Active CN109976806B (en) 2019-01-03 2019-01-03 Java statement block clone detection method based on byte code sequence matching

Country Status (1)

Country Link
CN (1) CN109976806B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851176B (en) * 2019-10-22 2023-07-04 天津大学 Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN111240740B (en) * 2020-01-23 2021-09-17 复旦大学 Code clone hazard assessment method based on evolution history analysis
CN115134142B (en) * 2022-06-28 2023-09-22 南京信息工程大学 Information hiding method and system based on file segmentation
CN115906104A (en) * 2023-02-23 2023-04-04 国网山东省电力公司泰安供电公司 Safety detection method and device for secondary packaged open-source assembly

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101375248A (en) * 2006-06-07 2009-02-25 香港应用科技研究院有限公司 Hardware Javatm bytecode translator
CN101739280A (en) * 2008-11-11 2010-06-16 爱思开电讯投资(中国)有限公司 System and method for optimizing byte codes for JAVA card
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method
CN106557350A (en) * 2015-09-30 2017-04-05 北京金山安全软件有限公司 JAVA byte code conversion method, device and equipment in application program installation package
CN106919403A (en) * 2017-03-16 2017-07-04 杭州承方信息科技有限公司 Many granularity Code Clones detection methods based on Java bytecode under cloud environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1387256B1 (en) * 2002-07-31 2018-11-21 Texas Instruments Incorporated Program counter adjustment based on the detection of an instruction prefix

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101375248A (en) * 2006-06-07 2009-02-25 香港应用科技研究院有限公司 Hardware Javatm bytecode translator
CN101739280A (en) * 2008-11-11 2010-06-16 爱思开电讯投资(中国)有限公司 System and method for optimizing byte codes for JAVA card
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method
CN106557350A (en) * 2015-09-30 2017-04-05 北京金山安全软件有限公司 JAVA byte code conversion method, device and equipment in application program installation package
CN106919403A (en) * 2017-03-16 2017-07-04 杭州承方信息科技有限公司 Many granularity Code Clones detection methods based on Java bytecode under cloud environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于关系矩阵的工作流日志重复任务识别算法;潘建梁,俞东进,陈耀旺;《计算机集成制造系统》;20180715;第1784-1792页 *
基于序列匹配和字节码的代码克隆检测研究;王杰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180531;I138-68 *
基于索引和序列匹配的代码克隆检测技术研究;舒翔;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151031;I138-130 *

Also Published As

Publication number Publication date
CN109976806A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109976806B (en) Java statement block clone detection method based on byte code sequence matching
CN105868108B (en) The unrelated binary code similarity detection method of instruction set based on neural network
Yu et al. Deescvhunter: A deep learning-based framework for smart contract vulnerability detection
CN111125716B (en) Method and device for detecting Ethernet intelligent contract vulnerability
CN110348214B (en) Method and system for detecting malicious codes
US7689527B2 (en) Attribute extraction using limited training data
CN106126235A (en) A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system
CN107967152B (en) Software local plagiarism evidence generation method based on minimum branch path function birthmarks
CN113901474B (en) Vulnerability detection method based on function-level code similarity
CN107239678B (en) Android application repacking detection method based on Java file directory structure
CN108985065A (en) The Calculate Mahalanobis Distance of application enhancements carries out the method and system of firmware Hole Detection
CN115033895B (en) Binary program supply chain safety detection method and device
CN115373737B (en) Code clone detection method based on feature fusion
CN111090859B (en) Malicious software detection method based on graph editing distance
CN110737469B (en) Source code similarity evaluation method based on semantic information on function granularity
CN106919403B (en) multi-granularity code clone detection method based on Java byte codes in cloud environment
CN110262957B (en) Reuse method of test cases among similar programs and implementation system thereof
CN115617395A (en) Intelligent contract similarity detection method fusing global and local features
CN115033884A (en) Binary code vulnerability detection method based on danger function parameter dependence
CN113177107B (en) Intelligent contract similarity detection method based on syntax tree matching
Liu et al. Vmpbl: Identifying vulnerable functions based on machine learning combining patched information and binary comparison technique by lcs
CN114780103B (en) Semantic code clone detection method based on graph matching network
CN115729612A (en) Source code and binary code matching method and device based on function call
CN117390130A (en) Code searching method based on multi-mode representation
CN115185728A (en) Software system architecture recovery method based on graph node embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant