CN109558314A - A method of it clones and detects towards Java source code - Google Patents

A method of it clones and detects towards Java source code Download PDF

Info

Publication number
CN109558314A
CN109558314A CN201811333968.0A CN201811333968A CN109558314A CN 109558314 A CN109558314 A CN 109558314A CN 201811333968 A CN201811333968 A CN 201811333968A CN 109558314 A CN109558314 A CN 109558314A
Authority
CN
China
Prior art keywords
class
clone
source code
code
sequence length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811333968.0A
Other languages
Chinese (zh)
Other versions
CN109558314B (en
Inventor
张凌浩
桂盛霖
常晓青
刘元生
梁晖辉
王胜
唐超
王海
张颉
甘炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Original Assignee
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd filed Critical Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority to CN201811333968.0A priority Critical patent/CN109558314B/en
Publication of CN109558314A publication Critical patent/CN109558314A/en
Application granted granted Critical
Publication of CN109558314B publication Critical patent/CN109558314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/125Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a kind of methods towards Java source code clone's detection, comprising the following steps: S1: the class extracted in Java source code extracts unit as minimum;S2: in java class function name and variable name carry out unified replacement;S3: the class in Java source code is compared, and exports comparison result.Other than space, annotation, other identical codes are simply cloned for the duplicate multiplexing that can not only efficiently detect entire code file or code snippet or code snippet of the invention.Meanwhile further directed to change, increase or program statement is deleted, but largely identical java source program code clone is detected the content of text of code, and exports accurate result.

Description

A method of it clones and detects towards Java source code
Technical field
The present invention relates to field of computer technology, and in particular to a method of it clones and detects towards Java source code.
Background technique
Between two code snippets when they are similar sequences, claim between them that there are clone's relationships.Root Clone can be defined to clone's class according to clone's relationship: a pair of of the code snippet for meeting clone's relationship is referred to as clone couple.Clone Class is one by cloning to the maximum set formed, and there are the code snippets that clone's degree is different in this set.Clone journey Degree is divided into from low to high: identical Code Clones, and it is all the same only to have modified other than annotation, layout and variable name etc. remaining Code Clones, and change, increase or delete program statement, but the approximate Code Clones of content of text of code.
Code Clones are very universal in software systems.Even the generally acknowledged high-quality system such as X of industry There are the Code Clones of 15-25% or so in the core of Code Clones of the WindowsSystem still with the presence of 19%, Linux. The ratio of Code Clones in regular software system is higher, some are even as high as 38%.
The cloned codes contained in detection source code are made of three key steps: intermediate form generates, prototype clone's inspection Survey and clone report generation.Source code be write for compiler and software maintenance staff, wherein containing largely with clone The unrelated information of code detection.Cloned codes detection generally can not directly operate source code, it is necessary to from source code Extraction only detects related information with cloned codes, generates the intermediate form for being convenient for clone's detection.Prototype clone's detection pair The intermediate form of source program is analyzed, and detects the program segment of intermediate form having the same.Due to the result of this process It is only identical in intermediate form, and last required Code Clones are not necessarily, we term it prototype clones.It is logical It crosses analysis and montage processing, clone's report generation to these prototypes clone and prototype clone is reduced into significant source code gram Grand right, exporting has good readable clone's report.Prototype clone be reduced into source code clone need intermediate form and Corresponding relationship between source code, therefore, intermediate form generating process also need to generate a program format information form, are used for Note down the corresponding relationship between intermediate form and source code.Cloned codes detection method depends on which type of intermediate form used Carry out representation program source code.
Do not account for the structural informations such as grammatical and semantic of program for text based clone's detection method, detection it is accurate Rate is lower.And based on abstract syntax tree building syntax tree cost it is higher, and to syntax tree carry out similitude matching when also want Higher space-time cost is paid, large-scale system software is not suitable for.Method space-time cost ratio based on program dependency graph is based on Tree structure it is higher, it is difficult to be applied to large software system.In addition, granularity can only be fixed in the method based on metric Detection, the accuracy of detection is low.Method based on Token detects fast speed and independent of development language, but It is difficult detection of complex Code Clones.
Summary of the invention
The technical problem to be solved by the present invention is to the versatilities and accuracy of various clone's detection methods in the prior art It is poor, and it is an object of the present invention to provide it is a kind of towards Java source code clone detection method, solve the above problems.
The present invention is achieved through the following technical solutions:
A method of it clones and detects towards Java source code, comprising the following steps: S1: extracting the class in Java source code Unit is extracted as minimum;S2: in java class function name and variable name carry out unified replacement;S3: in Java source code Class be compared, and export comparison result.
In the prior art, the structures such as the grammatical and semantic of program are not accounted for for text based clone's detection method to believe Breath, the accuracy rate of detection are lower.And the cost of the building syntax tree based on abstract syntax tree is higher, and carries out to syntax tree similar Property matching when also to pay higher space-time cost, be not suitable for large-scale system software.When method based on program dependency graph Empty cost is than based on the higher of tree structure, it is difficult to be applied to large software system.In addition, the method based on metric can only be into The detection of the fixed granularity of row, the accuracy of detection are low.Method based on Token detects fast speed and independent of exploitation Language, but it is difficult detection of complex Code Clones.
The present invention is in application, the class extracted in Java source code extracts unit as minimum, and to the function in java class Name and variable name carry out unified replacement, so that when carrying out clone's detection to different codes, even if function name and change in class Amount title is changed, and the accurate comparison of class also may be implemented, then carry out clone's detection to whole section of code by class to have The accuracy of raising clone's detection of effect, duplicate multiplexing of this mode for entire code file or code snippet Or code snippet, other than space, annotation, other identical codes are simply cloned especially efficiently.The present invention is above-mentioned by being arranged Step realizes the efficient detection to Java source code clone, compared with the prior art less expensive and efficiency greatly improves.
Further, step S2 includes following sub-step: scanning each java class, and will indicate variable name in java class Terminal symbol uniformly replace with literal;Definition is used to extend or the field of customized type.
The present invention in application, detect bring for the different of variable name in shielding java source program to cloned codes and influence, Each java class is scanned, the basic classes such as the terminal symbol [id], [number], [stringlit] of variable name will be indicated in java class Type uniformly replaces with literal x, and customized [tokens] section is for extension or customized type.
Further, step S1 includes following sub-step: defining the non-end mark of root of a class, and is with the non-end mark of root Root node establishes analytic tree;Father node in the analytic tree is made of the child node of the corresponding multiple sequences of the father node; The father node of the analytic tree is non-end mark;The leaf node of the analytic tree is end mark;By described non-end mark It is expressed as the sequence of all leaf nodes.
The present invention is in application, in order to further directed to change, increase or delete program statement, but the content of text of code is big The identical java source program code clone in part is detected, and the present invention parses class file, and generates analytic tree, according to The above method, which is equivalent to, is from top to bottom defined each non-end mark, until each non-end mark is represented as a system The end mark of column.Each java class can be thus extracted from java source program, and is retained java class and risen in source code It begins and the location information of end line and generates corresponding analytic tree.
Further, step S3 includes following sub-step: setting variance rate threshold k;For class C1 to be compared, its sequence is remembered For S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence length M2;When M1 is in section [M2, (1+ K) M2] it is interior when, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 for pre-selection clone class.
The present invention is in application, in order to judge that two classes to be compared with the presence or absence of clone's relationship, need to introduce a parameter Determine, this parameter is exactly variance rate threshold k.If the code size difference of two classes to be compared is excessive, they are inevitable Code Clones cannot be constituted.Therefore for class C1 to be compared, remember that its code sequence length is M1, if the currently variance rate of clone's class Threshold value is K, then for the class C2 being compared with C1, if its code sequence length is M2, only when M1 is in section [M2, (1+K) M2] it is inner when, C2 is likely to C1 constitute clone's class there are clone's relationship.
Further, step S3 further includes following sub-step: when C1 and C2 is pre-selection clone's class, extracting S1 and S2 most Long common subsequence S, and remember that the length of longest common subsequence S is M;It obtains sub-sequence length M1 ' exclusive in S1, obtains S2 In exclusive sub-sequence length M2 ';The variance rate R1 that S1 is obtained by M1 and M1 ', the variance rate of S2 is obtained by M2 and M2 ' R2;When R1 and R2 are below K, determine that C1 and C2 clones class each other.
The present invention in application, program source code is converted and is handled after the sequence that generates carry out similarity detection.For Further directed to change, increase or program statement is deleted, but the content of text of code most of identical java source program generation Code clone is detected, and the invention detects two classes by the way of longest common subsequence, generally Accounting of the longest common subsequence in class is higher for as soon as, illustrates that two classes are more similar, this mode is for code portions The changed clone's detection of sequence is divided to be particularly effective.
Further, M1 ' is obtained according to the following formula: M1 '=M1-M;In formula, M1 ' is sub-sequence length exclusive in S1; M1 For S1 sequence length;M is the longest common subsequence length of S1 and S2;M2 ' is obtained according to the following formula: M2 '=M2-M;In formula, M2 ' For sub-sequence length exclusive in S2;M2 is S2 sequence length;M is the longest common subsequence length of S1 and S2.
Further, the R1 is obtained according to the following formula: R1=(M1 '/M1) × 100%;M1 ' is son exclusive in S1 in formula Sequence length;M1 is S1 sequence length;The R2 is obtained according to the following formula: R2=(M2 '/M2) × 100%;M2 ' is in S2 in formula Exclusive sub-sequence length;M2 is S2 sequence length.
The present invention according to above-mentioned several formula in application, can obtain the variance rate of two classes to be detected, thus according to difference Different rate detects the similarity degree of code.
Further, the longest common subsequence S for extracting S1 and S2 includes following sub-step: enabling S1=S1(1)S1(2)… S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk(i)Sk(i+1)…Sk(j) son Character string, wherein k=1 or 2;Defining LCS (S1, S2) is a sequence, and the LCS (S1, S2) exists in S1 and S2, and The sequence occurred in S1 and S2 is also identical;LCS (S1, S2) is solved, and obtains the longest common subsequence of S1 and S2 S=LCS (S1, S2).
The present invention is in application, present invention employs the LCS based on Dynamic Programming in order to search out longest common subsequence S Algorithm improves detection efficiency.
Further, the solution of the LCS (S1, S2) uses improved dynamic programming algorithm.
Further, the variance rate threshold k is 10%~40%.
Compared with prior art, the present invention having the following advantages and benefits:
1, a kind of method towards Java source code clone's detection of the present invention is realized pair by the way that above-mentioned steps are arranged The efficient detection of Java source code clone, compared with the prior art less expensive and efficiency greatly improves;
2, a kind of method towards Java source code clone's detection of the present invention, is directed to by analytic tree and changes, increases or delete Except program statement, but largely identical java source program code clone is detected the content of text of code, improves detection Precision;
3, a kind of method towards Java source code clone's detection of the present invention, by the way of longest common subsequence pair Two classes are detected, and clone's detection changed for code section sequence is particularly effective;
4, of the invention efficiently to detect the duplicate multiple of entire code file or code snippet With or code snippet in addition to space, annotation other than, other identical codes are simply cloned.Meanwhile further directed to change, increase Or program statement is deleted, but largely identical java source program code clone is detected the content of text of code, and is exported Accurate result.
Detailed description of the invention
Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the method for the present invention step schematic diagram;
Fig. 2 is schematic diagram of the embodiment of the present invention;
Fig. 3 a is schematic diagram of the embodiment of the present invention;
Fig. 3 b is schematic diagram of the embodiment of the present invention;
Fig. 4 is schematic diagram of the embodiment of the present invention;
Fig. 5 is schematic diagram of the embodiment of the present invention;
Fig. 6 is schematic diagram of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made For limitation of the invention.
Embodiment 1
As shown in Figure 1, a kind of method towards Java source code clone's detection of the present invention, comprising the following steps: S1: extract Class in Java source code extracts unit as minimum;S2: in java class function name and variable name carry out unified replacement; S3: the class in Java source code is compared, and exports comparison result.
When the present embodiment is implemented, the class extracted in Java source code extracts unit as minimum, and to the letter in java class Several and variable name carries out unified replacement so that when carrying out clone's detection to different codes, even if in class function name and Name variable is changed, and also may be implemented the accurate comparison of class, then carrying out clone's detection to whole section of code by class can be with The effective accuracy for improving clone's detection, this mode are duplicate multiple for entire code file or code snippet With or code snippet in addition to space, annotation it is outer, other identical codes are simply cloned especially efficient.The present invention passes through in setting Step is stated, realizes the efficient detection to Java source code clone, compared with the prior art less expensive and efficiency substantially mentions It is high.
Embodiment 2
On the basis of embodiment 1, step S2 includes following sub-step to the present embodiment: scanning each java class, and will Indicate that the terminal symbol of variable name uniformly replaces with literal in Java class;Definition is used to extend or the word of customized type Section.
When the present embodiment is implemented, bring shadow is detected to cloned codes for the different of variable name in shielding java source program It rings, scans each java class, will indicate that terminal symbol [id], [number], [stringlit] of variable name etc. is basic in java class Type uniformly replaces with literal x, notes content is uniformly replaced with [comment], and customized [tokens] section is used In extension or customized type.
Embodiment 3
On the basis of embodiment 1, step S1 includes following sub-step to the present embodiment: defining the non-end of root of a class Symbol, and analytic tree is established using the non-end mark of root as root node;Father node in the analytic tree is corresponding more by the father node The child node of a sequence is constituted;The father node of the analytic tree is non-end mark;The leaf node of the analytic tree is to terminate Symbol;Described non-end mark is expressed as to the sequence of all leaf nodes.
When the present embodiment is implemented, in order to further directed to change, increase or delete program statement, but the content of text of code Most of identical java source program code clone is detected, and the present invention parses class file, and generates analytic tree, presses It is equivalent to according to the above method and each non-end mark is defined from top to bottom, until each non-end mark is represented as one The end mark of series.Each java class can be thus extracted from java source program, and retains java class in source code Initial row and the location information of end line simultaneously generate corresponding analytic tree.
Embodiment 4
For the present embodiment on the basis of embodiment 3, step S3 includes following sub-step: setting variance rate threshold k;For to Compare class C1, remembers that its sequence is S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence length M2; When M1 is when section [M2, (1+K) M2] is interior, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 as pre-selection gram Grand class.
When the present embodiment is implemented, in order to judge that two classes to be compared with the presence or absence of clone's relationship, need to introduce a ginseng For number to determine, this parameter is exactly variance rate threshold k.If the code size difference of two classes to be compared is excessive, they must Code Clones cannot so be constituted.Therefore for class C1 to be compared, remember that its code sequence length is M1, if the currently difference of clone's class Rate threshold value is K, then for the class C2 being compared with C1, if its code sequence length is M2, only when M1 is in section [M2, (1+ K) M2] it is inner when, C2 is likely to C1 constitute clone's class there are clone's relationship.
Embodiment 5
For the present embodiment on the basis of embodiment 3, step S3 further includes following sub-step: when C1 and C2 is pre-selection clone's class When, the longest common subsequence S of S1 and S2 is extracted, and remember that the length of longest common subsequence S is M;Obtain son exclusive in S1 Sequence length M1 ' obtains sub-sequence length M2 ' exclusive in S2;The variance rate R1 that S1 is obtained by M1 and M1 ', by M2 and M2 ' obtains the variance rate R2 of S2;When R1 and R2 are below K, determine that C1 and C2 clones class each other.
When the present embodiment is implemented, the sequence generated after program source code is converted and handled carries out similarity detection. In order to further directed to change, increase or delete program statement, but the most of identical java source program of the content of text of code Code Clones are detected, and the invention detects two classes by the way of longest common subsequence, and one Accounting of the longest common subsequence in class is higher for as soon as, illustrates that two classes are more similar, this mode is for code The changed clone's detection of part order is particularly effective.
Embodiment 6
On the basis of embodiment 5, M1 ' is obtained the present embodiment according to the following formula: M1 '=M1-M;In formula, M1 ' is only in S1 Some sub-sequence lengths;M1 is S1 sequence length;M is the longest common subsequence length of S1 and S2;M2 ' is obtained according to the following formula: M2 '=M2-M;In formula, M2 ' is sub-sequence length exclusive in S2;M2 is S2 sequence length;M is that the longest of S1 and S2 is public Sub-sequence length.The R1 is obtained according to the following formula: R1=(M1 '/M1) × 100%;M1 ' is subsequence exclusive in S1 in formula Length;M1 is S1 sequence length;The R2 is obtained according to the following formula: R2=(M2 '/M2) × 100%;M2 ' is exclusive in S2 in formula Sub-sequence length;M2 is S2 sequence length.
When the present embodiment is implemented, the variance rate of two classes to be detected can be obtained according to above-mentioned several formula, thus according to Variance rate detects the similarity degree of code.
Embodiment 7
For the present embodiment on the basis of embodiment 5, the longest common subsequence S for extracting S1 and S2 includes following sub-step: Enable S1=S1(1)S1(2)…S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk (i)Sk(i+1)…Sk(j) substring, wherein k=1 or 2;Defining LCS (S1, S2) is a sequence, the LCS (S1, S2) exist in S1 and S2, and the sequence occurred in S1 and S2 is also identical;LCS (S1, S2) is solved, and is obtained The longest common subsequence S=LCS (S1, S2) of S1 and S2.
When the present embodiment is implemented, in order to search out longest common subsequence S, present invention employs based on Dynamic Programming LCS algorithm, improves detection efficiency.
Embodiment 8
In order to which further the present invention will be described, the present embodiment is on the basis of Examples 1 to 7, using TXL language It is handled:
Illustrate the operating mechanism of TXL language first, as shown in Figure 2:
1, syntactic analysis: this stage txl program analyses input source program according to the corresponding language grammer of definition, obtains The analytic tree of source program.
2, generate analytic tree: this stage txl program converts entire analytic tree according to customized rewriting grammer, obtains Analytic tree after to conversion.
3, parsing and output stage: the analytic tree after conversion is reduced to the source code of textual form by this stage txl program.
The present invention analyzes given java morphology specification, syntactic definition input using txl language, and grammer is to extend BNF Form provides, and carries out structural description, is divided into following steps.
Step 1: defining java syntax rule, and the minimum of java program source code is arranged and extracts unit:
Write java.grm file for defining and the syntax rule of specification java language.
Syntactic definition is carried out to java language with BNF form first.Defining a java program source file is a non-knot It is the non-end mark of root entirely inputted that beam, which accords with [program],.It is by other a series of non-end mark [classes_ Declaration] and [classes_define] composition.Wherein, non-end mark [classes_declaration] indicates class Statement, [classes_define] indicate class definition statement.[classes_define]-is further decomposed by [classes_ Header] and [classes_body] composition, declarative statement can be divided into class name, class interface, variable name initialization etc..For [classes_header] is in grammar file by [repeat modifier], [class_name] [opt extends_ Clause], [opt implements_clause] constitute;[classes_body] is by [class_or_interface_ Body] and [interface_header] composition.Each non-end mark is defined from top to bottom according to the method described above, always It is represented as a series of end mark to each non-end mark, and generates analytic tree.The structure of analytic tree such as Fig. 3 a and Fig. 3 b It is shown.
Writing minimum extract unit of the java-abstract-classes file for defining java program source code is class.Define the non-end mark of root that [classes_define] is this document.Further decomposing [classes_define] is A series of [classes_header] and [classes_body], [classes_body] is by [class_or_interface_ Body] and [interface_header] composition.It constantly decomposes until generating end mark.
Using TXL to realize as the programming of BNF form is only a specific embodiment of the invention, unless specifically stated, TXL Can it is equivalent by other or have similar purpose programming language replaced.
Step 2: in java class function name and variable name carry out unified replacement:
The present invention is to shield the different of variable name in java source program to influence cloned codes detection bring, is write Java-rename-blind-classess.txl file is to realize to judicial specification.java-rename-blind- The Substitution Rules of classes.txl document definition function name: it scans each java class, and the end of variable name will be indicated in java class The fundamental types such as knot symbol [id], [number], [stringlit] uniformly replace with literal x, and notes content is uniformly replaced It is changed to [comment], and customized [tokens] section, for extension or customized type, tag definitions structure chart is such as Shown in Fig. 3 a and Fig. 3 b.
Step 3: setting variance rate threshold value
In order to judge two classes to be compared with the presence or absence of clone's relationship, need to introduce a parameter to determine, this is joined Number is exactly variance rate threshold.If the code size difference of two classes to be compared is excessive, they cannot necessarily constitute code gram It is grand.Therefore for class C1 to be compared, remember that its code sequence length is M1, if currently the variance rate threshold value of clone's class is K, then for The class C2 being compared with C1, if its code sequence length is M2, only when M2 is when section [M2, (1+K) M2] is inner, C1 is May be with C2 there are clone's relationship, composition clones class.
Step 4: clone's detection algorithm is write
The sequence generated after program source code is converted and handled carries out similarity detection.The present invention is public using longest Subsequence algorithm LCS (LCS, Longest Common Sequence) compares the java class extracted two-by-two altogether, to sentence Determine whether they clone class each other.
1, class C1 and C2 to be compared for two, after C1 and C2 parsing, if the sequence of input is respectively Q1 and note Q2, length Degree is respectively M1 and M2, extracts the Q2 longest common subsequence S of Q1 sum, the length is M for note;
2, calculate separately the length M1 ' and M2 ' of subsequence exclusive in Q1 and Q2, value be M1 '=M1-M and M2 '= M2-M;
3, the variance rate for calculating Q1 and Q2, is denoted as R1 and R2 respectively, meets R1=(M1 '/M1) × 100%, R2= (M2 '/M2) × 100%;
4, finally R1, R2 are compared with preset variance rate threshold k respectively, if R1 and R2 are below K, are sentenced Determine C1 and C2 and clone class each other, otherwise C1 and C2 does not constitute clone's class.
LCS algorithm based on Dynamic Programming realizes that process is as follows:
Enable S1=S1(1)S1(2)…S1(n) and S is enabled2=S2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) table Show.
Sk(i)Sk(i+1)…Sk(j) substring (wherein K=1 or 2).The LCS of so this two sequence is defined as follows:
Define 1S1, and S2LCS refer to and collectively reside in S1And S2As long as occur sequence and in S1And S2Interior appearance It is sequentially identical.Enable LLCS (S1,S2) indicate S1And S2LCS length.
Such as: enable S1=AACTACC, S2=ATTACCT, then LCS (S1,S2) it is ATACC, and LCS (S1,S2) it is 5.
The present invention carries out solution LCS using improved dynamic programming algorithm.It is as follows that a recursive function is created first:
Wherein, M (i, j) indicates S1(1, i) and S2LCS length between (1, j).So M (n, m)=LLCS (S1,S2), M is longest clone subsequence.Implementation flow chart is as shown in Figure 4.
Simultaneously before carrying out Code Clones relatively, needing to be arranged clone's class diversity factor threshold k, accepted value has 15%, 25%, 35%.If the diversity factor that two classes are calculated in LCS algorithm be respectively and P1 and P2, P1 and P2 are not higher than K then judge this two A class clones relationship each other, and otherwise the two is not clone.
Step 5: output test result
The present invention provides result with two different representations.User can choose any one of the two or two Person.First is the traditional text report for cloning category information, wherein each clone's class shows corresponding filename and code segment Line number is derived from from the line number of source file and potential clone annotation.Second is visable representation, it generates a HTML page Face shows example of first code segment as each clone's class.Each clone's class is linked to many secondary clone report pages Face, these pages show other members of same class.It can also be seen that similarity value each pair of in class.Implementer's case stream Journey figure is as shown in Figure 1.
The invention proposes a kind of, and the java applet source code based on LCS algorithm clones detection method, this algorithm is by looking into It askes the longest common subsequence in java source program in inhomogeneity and calculates the similitude of detection code, sentenced by similitude It is disconnected that whether two java classes constitute clone's relationship.Efficiently solves the detection of Code Clones different degrees of from low to high Problem.This is function not available for other current existing Code Clones detection instruments.
Embodiment 9
For the present embodiment on the basis of embodiment 8, there are two main classes for present Code Clones detection method: first kind foundation The tools such as content detection, including CCFinder, Clone DR.These tools use suffix tree, abstract syntax tree, program dependency graph The methods of, to detect the similitude between code content, it can detecte out the code snippet multiplexing feelings in one or several engineerings Condition has higher accuracy rate and recall rate.But algorithm is complicated, realizes more difficult, some excessive computing resources of consumption, and can expand Malleability is limited, the clone's detection not being suitable in extensive code data.And these tools are partial to detect code snippet etc. The clone of finer grain detects more suitable for the Code Clones in unitem.Second class has then used keyword lookup, file Name matching and be not based on the method for code text content to detect Code Clones, the tools such as including IchiTrache.Such method Fairly large data are capable of handling, but accuracy rate and recall rate are obviously not so good as the detection instrument based on content such as CCFinder. And the present invention increases more flexible and restricted code specification and filtering for processing java program documentaion, and provides hair Rear normalization inexact matching needed for existing many intimate miss of unstructuredness, can make different degrees of Code Clones Efficient detection, this is functional characteristics not available for other tools.With current code gram relatively advanced, based on Token Grand detection instrument CCFinder is compared, the present invention be able to detect that CCFinder cannot detect in addition to identification name and class Identical code other than not, by some insertions, identical code and semantic generation identical with function other than deleting and replacing Code clone.
It is illustrated by taking two following java classes as an example: as shown in Figure 5 and Figure 6, the change of the example of Fig. 6 to the class of Fig. 5 Amount name is modified, and is added to part comment statement.Further, declarative statement inta=0 is increased, while by intb =0 declarative statement and this.frame=frame is exchanged.It is the code gram that can not detect the type in CCfinder It is grand, and set corresponding threshold in the present invention and can be detected out the two classes there are clone's relationships.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (10)

1. a kind of method towards Java source code clone's detection, which comprises the following steps:
S1: the class extracted in Java source code extracts unit as minimum;
S2: in java class function name and variable name carry out unified replacement;
S3: the class in Java source code is compared, and exports comparison result.
2. a kind of method towards Java source code clone's detection according to claim 1, which is characterized in that step S2 packet Include following sub-step:
Each java class is scanned, and will indicate that the terminal symbol of variable name uniformly replaces with literal in java class;
Definition is used to extend or the field of customized type.
3. a kind of method towards Java source code clone's detection according to claim 1, which is characterized in that step S1 packet Include following sub-step:
The non-end mark of root of a class is defined, and establishes analytic tree using the non-end mark of root as root node;
Father node in the analytic tree is made of the child node of the corresponding multiple sequences of the father node;
The father node of the analytic tree is non-end mark;
The leaf node of the analytic tree is end mark;
Described non-end mark is expressed as to the sequence of all leaf nodes.
4. a kind of method towards Java source code clone's detection according to claim 3, which is characterized in that step S3 packet Include following sub-step:
Variance rate threshold k is set;
For class C1 to be compared, remember that its sequence is S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence Length is M2;
When M2 is when section [M2, (1+K) M2] is interior, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 is pre- Choosing clone's class.
5. a kind of method towards Java source code clone's detection according to claim 4, which is characterized in that step S3 is also Including following sub-step:
When C1 and C2 is pre-selection clone's class, the longest common subsequence S of S1 and S2 is extracted, and remember longest common subsequence S's Length is M;
It obtains sub-sequence length M1 ' exclusive in S1, obtains sub-sequence length M2 ' exclusive in S2;
The variance rate R1 that S1 is obtained by M1 and M1 ' obtains the variance rate R2 of S2 by M2 and M2 ';
When R1 and R2 are below K, determine that C1 and C2 clones class each other.
6. a kind of method towards Java source code clone's detection according to claim 5, which is characterized in that M1 ' basis Following formula obtains:
M1 '=M1-M;In formula, M1 ' is sub-sequence length exclusive in S1;M1 is S1 sequence length;The longest that M is S1 and S2 is public Sub-sequence length altogether;
M2 ' is obtained according to the following formula:
M2 '=M2-M;In formula, M2 ' is sub-sequence length exclusive in S2;M2 is S2 sequence length;The longest that M is S1 and S2 is public Sub-sequence length altogether.
7. a kind of method towards Java source code clone's detection according to claim 5, which is characterized in that the R1 root It is obtained according to following formula:
R1=(M1 '/M1) × 100%;M1 ' is sub-sequence length exclusive in S1 in formula;M1 is S1 sequence length;
The R2 is obtained according to the following formula:
R2=(M2 '/M2) × 100%;M2 ' is sub-sequence length exclusive in S2 in formula;M2 is S2 sequence length.
8. it is according to claim 5 it is a kind of towards Java source code clone detection method, which is characterized in that extract S1 and The longest common subsequence S of S2 includes following sub-step:
Enable S1=S1(1)S1(2)…S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk (i)Sk(i+1)…Sk(j) substring, wherein k=1 or 2;
Defining LCS (S1, S2) is a sequence, and the LCS (S1, S2) exists in S1 and S2, and occurs in S1 and S2 Sequence it is also identical;
LCS (S1, S2) is solved, and obtains the longest common subsequence S=LCS (S1, S2) of S1 and S2.
9. a kind of method towards Java source code clone's detection according to claim 8, which is characterized in that the LCS The solution of (S1, S2) uses improved dynamic programming algorithm.
10. a kind of method towards Java source code clone's detection according to claim 4, which is characterized in that the difference Different rate threshold k is 10%~40%.
CN201811333968.0A 2018-11-09 2018-11-09 Java source code clone detection oriented method Active CN109558314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811333968.0A CN109558314B (en) 2018-11-09 2018-11-09 Java source code clone detection oriented method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811333968.0A CN109558314B (en) 2018-11-09 2018-11-09 Java source code clone detection oriented method

Publications (2)

Publication Number Publication Date
CN109558314A true CN109558314A (en) 2019-04-02
CN109558314B CN109558314B (en) 2021-07-27

Family

ID=65865936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811333968.0A Active CN109558314B (en) 2018-11-09 2018-11-09 Java source code clone detection oriented method

Country Status (1)

Country Link
CN (1) CN109558314B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765003A (en) * 2019-09-24 2020-02-07 贝壳技术有限公司 Code detection method, device and equipment, and storage medium
CN110989991A (en) * 2019-10-25 2020-04-10 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN112199115A (en) * 2020-09-21 2021-01-08 复旦大学 Cross-Java byte code and source code line association method based on feature similarity matching

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262047A (en) * 2010-12-15 2013-08-21 微软公司 Intelligent code differencing using code clone detection
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
US20160188885A1 (en) * 2014-12-26 2016-06-30 Korea University Research And Business Foundation Software vulnerability analysis method and device
CN106843840A (en) * 2016-12-23 2017-06-13 中国科学院软件研究所 A kind of version evolving annotation multiplexing method of source code based on similarity analysis
US20170185783A1 (en) * 2015-12-29 2017-06-29 Sap Se Using code similarities for improving auditing and fixing of sast-discovered code vulnerabilities

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262047A (en) * 2010-12-15 2013-08-21 微软公司 Intelligent code differencing using code clone detection
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
US20160188885A1 (en) * 2014-12-26 2016-06-30 Korea University Research And Business Foundation Software vulnerability analysis method and device
US20170185783A1 (en) * 2015-12-29 2017-06-29 Sap Se Using code similarities for improving auditing and fixing of sast-discovered code vulnerabilities
CN106843840A (en) * 2016-12-23 2017-06-13 中国科学院软件研究所 A kind of version evolving annotation multiplexing method of source code based on similarity analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏小红等: "《面向管理的克隆代码研究综述》", 《计算机学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765003A (en) * 2019-09-24 2020-02-07 贝壳技术有限公司 Code detection method, device and equipment, and storage medium
CN110989991A (en) * 2019-10-25 2020-04-10 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN110989991B (en) * 2019-10-25 2023-12-01 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN112199115A (en) * 2020-09-21 2021-01-08 复旦大学 Cross-Java byte code and source code line association method based on feature similarity matching

Also Published As

Publication number Publication date
CN109558314B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN103559444B (en) A kind of sql injects detection method and device
US7506324B2 (en) Enhanced compiled representation of transformation formats
Brody et al. A structural model for contextual code changes
US20020143823A1 (en) Conversion system for translating structured documents into multiple target formats
US20070136698A1 (en) Method, system and apparatus for a parser for use in the processing of structured documents
US20060212859A1 (en) System and method for generating XML-based language parser and writer
CN109558314A (en) A method of it clones and detects towards Java source code
US9311058B2 (en) Jabba language
Han et al. Chinese named entity recognition with conditional random fields in the light of chinese characteristics
CN106649769B (en) Semantic-based conversion method from XBRL data to OWL data
CN107038163A (en) A kind of text semantic modeling method towards magnanimity internet information
Bai et al. Enhanced natural language interface for web-based information retrieval
Zhang et al. Hgen: Learning hierarchical heterogeneous graph encoding for math word problem solving
Zhou et al. Summarizing source code with hierarchical code representation
CN112506488A (en) Method for generating programming language class based on sql creating statement
Koskimies et al. The design of a language processor generator
CN111124422A (en) EOS intelligent contract language conversion method based on abstract syntax tree
Nghiem et al. Using MathML parallel markup corpora for semantic enrichment of mathematical expressions
Park et al. ALSI-Transformer: Transformer-based code comment generation with aligned lexical and syntactic information
Li et al. AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction
KR101225333B1 (en) System and method using tree pattern expression for extraction information from syntactically parsed text corpora
CN117193781B (en) SIMSCRIPT language-oriented abstract syntax tree construction method and device
Schneiker et al. Declarative Parsing and Annotation of Electronic Dictionaries.
Xiao Transformation System of two Similar Syntax Programs Based on the Compiler Principle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant