CN109558314A - A method of it clones and detects towards Java source code - Google Patents
A method of it clones and detects towards Java source code Download PDFInfo
- Publication number
- CN109558314A CN109558314A CN201811333968.0A CN201811333968A CN109558314A CN 109558314 A CN109558314 A CN 109558314A CN 201811333968 A CN201811333968 A CN 201811333968A CN 109558314 A CN109558314 A CN 109558314A
- Authority
- CN
- China
- Prior art keywords
- class
- clone
- source code
- code
- sequence length
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000001514 detection method Methods 0.000 claims abstract description 63
- 239000000284 extract Substances 0.000 claims abstract description 10
- 239000010749 BS 2869 Class C1 Substances 0.000 claims description 7
- 239000010750 BS 2869 Class C2 Substances 0.000 claims description 7
- 239000012634 fragment Substances 0.000 claims description 4
- 230000008859 change Effects 0.000 abstract description 9
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 239000000203 mixture Substances 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010367 cloning Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 229910017435 S2 In Inorganic materials 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/10—Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
- G06F21/12—Protecting executable software
- G06F21/121—Restricting unauthorised execution of programs
- G06F21/125—Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Technology Law (AREA)
- Quality & Reliability (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a kind of methods towards Java source code clone's detection, comprising the following steps: S1: the class extracted in Java source code extracts unit as minimum;S2: in java class function name and variable name carry out unified replacement;S3: the class in Java source code is compared, and exports comparison result.Other than space, annotation, other identical codes are simply cloned for the duplicate multiplexing that can not only efficiently detect entire code file or code snippet or code snippet of the invention.Meanwhile further directed to change, increase or program statement is deleted, but largely identical java source program code clone is detected the content of text of code, and exports accurate result.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a method of it clones and detects towards Java source code.
Background technique
Between two code snippets when they are similar sequences, claim between them that there are clone's relationships.Root
Clone can be defined to clone's class according to clone's relationship: a pair of of the code snippet for meeting clone's relationship is referred to as clone couple.Clone
Class is one by cloning to the maximum set formed, and there are the code snippets that clone's degree is different in this set.Clone journey
Degree is divided into from low to high: identical Code Clones, and it is all the same only to have modified other than annotation, layout and variable name etc. remaining
Code Clones, and change, increase or delete program statement, but the approximate Code Clones of content of text of code.
Code Clones are very universal in software systems.Even the generally acknowledged high-quality system such as X of industry
There are the Code Clones of 15-25% or so in the core of Code Clones of the WindowsSystem still with the presence of 19%, Linux.
The ratio of Code Clones in regular software system is higher, some are even as high as 38%.
The cloned codes contained in detection source code are made of three key steps: intermediate form generates, prototype clone's inspection
Survey and clone report generation.Source code be write for compiler and software maintenance staff, wherein containing largely with clone
The unrelated information of code detection.Cloned codes detection generally can not directly operate source code, it is necessary to from source code
Extraction only detects related information with cloned codes, generates the intermediate form for being convenient for clone's detection.Prototype clone's detection pair
The intermediate form of source program is analyzed, and detects the program segment of intermediate form having the same.Due to the result of this process
It is only identical in intermediate form, and last required Code Clones are not necessarily, we term it prototype clones.It is logical
It crosses analysis and montage processing, clone's report generation to these prototypes clone and prototype clone is reduced into significant source code gram
Grand right, exporting has good readable clone's report.Prototype clone be reduced into source code clone need intermediate form and
Corresponding relationship between source code, therefore, intermediate form generating process also need to generate a program format information form, are used for
Note down the corresponding relationship between intermediate form and source code.Cloned codes detection method depends on which type of intermediate form used
Carry out representation program source code.
Do not account for the structural informations such as grammatical and semantic of program for text based clone's detection method, detection it is accurate
Rate is lower.And based on abstract syntax tree building syntax tree cost it is higher, and to syntax tree carry out similitude matching when also want
Higher space-time cost is paid, large-scale system software is not suitable for.Method space-time cost ratio based on program dependency graph is based on
Tree structure it is higher, it is difficult to be applied to large software system.In addition, granularity can only be fixed in the method based on metric
Detection, the accuracy of detection is low.Method based on Token detects fast speed and independent of development language, but
It is difficult detection of complex Code Clones.
Summary of the invention
The technical problem to be solved by the present invention is to the versatilities and accuracy of various clone's detection methods in the prior art
It is poor, and it is an object of the present invention to provide it is a kind of towards Java source code clone detection method, solve the above problems.
The present invention is achieved through the following technical solutions:
A method of it clones and detects towards Java source code, comprising the following steps: S1: extracting the class in Java source code
Unit is extracted as minimum;S2: in java class function name and variable name carry out unified replacement;S3: in Java source code
Class be compared, and export comparison result.
In the prior art, the structures such as the grammatical and semantic of program are not accounted for for text based clone's detection method to believe
Breath, the accuracy rate of detection are lower.And the cost of the building syntax tree based on abstract syntax tree is higher, and carries out to syntax tree similar
Property matching when also to pay higher space-time cost, be not suitable for large-scale system software.When method based on program dependency graph
Empty cost is than based on the higher of tree structure, it is difficult to be applied to large software system.In addition, the method based on metric can only be into
The detection of the fixed granularity of row, the accuracy of detection are low.Method based on Token detects fast speed and independent of exploitation
Language, but it is difficult detection of complex Code Clones.
The present invention is in application, the class extracted in Java source code extracts unit as minimum, and to the function in java class
Name and variable name carry out unified replacement, so that when carrying out clone's detection to different codes, even if function name and change in class
Amount title is changed, and the accurate comparison of class also may be implemented, then carry out clone's detection to whole section of code by class to have
The accuracy of raising clone's detection of effect, duplicate multiplexing of this mode for entire code file or code snippet
Or code snippet, other than space, annotation, other identical codes are simply cloned especially efficiently.The present invention is above-mentioned by being arranged
Step realizes the efficient detection to Java source code clone, compared with the prior art less expensive and efficiency greatly improves.
Further, step S2 includes following sub-step: scanning each java class, and will indicate variable name in java class
Terminal symbol uniformly replace with literal;Definition is used to extend or the field of customized type.
The present invention in application, detect bring for the different of variable name in shielding java source program to cloned codes and influence,
Each java class is scanned, the basic classes such as the terminal symbol [id], [number], [stringlit] of variable name will be indicated in java class
Type uniformly replaces with literal x, and customized [tokens] section is for extension or customized type.
Further, step S1 includes following sub-step: defining the non-end mark of root of a class, and is with the non-end mark of root
Root node establishes analytic tree;Father node in the analytic tree is made of the child node of the corresponding multiple sequences of the father node;
The father node of the analytic tree is non-end mark;The leaf node of the analytic tree is end mark;By described non-end mark
It is expressed as the sequence of all leaf nodes.
The present invention is in application, in order to further directed to change, increase or delete program statement, but the content of text of code is big
The identical java source program code clone in part is detected, and the present invention parses class file, and generates analytic tree, according to
The above method, which is equivalent to, is from top to bottom defined each non-end mark, until each non-end mark is represented as a system
The end mark of column.Each java class can be thus extracted from java source program, and is retained java class and risen in source code
It begins and the location information of end line and generates corresponding analytic tree.
Further, step S3 includes following sub-step: setting variance rate threshold k;For class C1 to be compared, its sequence is remembered
For S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence length M2;When M1 is in section [M2, (1+
K) M2] it is interior when, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 for pre-selection clone class.
The present invention is in application, in order to judge that two classes to be compared with the presence or absence of clone's relationship, need to introduce a parameter
Determine, this parameter is exactly variance rate threshold k.If the code size difference of two classes to be compared is excessive, they are inevitable
Code Clones cannot be constituted.Therefore for class C1 to be compared, remember that its code sequence length is M1, if the currently variance rate of clone's class
Threshold value is K, then for the class C2 being compared with C1, if its code sequence length is M2, only when M1 is in section [M2, (1+K)
M2] it is inner when, C2 is likely to C1 constitute clone's class there are clone's relationship.
Further, step S3 further includes following sub-step: when C1 and C2 is pre-selection clone's class, extracting S1 and S2 most
Long common subsequence S, and remember that the length of longest common subsequence S is M;It obtains sub-sequence length M1 ' exclusive in S1, obtains S2
In exclusive sub-sequence length M2 ';The variance rate R1 that S1 is obtained by M1 and M1 ', the variance rate of S2 is obtained by M2 and M2 '
R2;When R1 and R2 are below K, determine that C1 and C2 clones class each other.
The present invention in application, program source code is converted and is handled after the sequence that generates carry out similarity detection.For
Further directed to change, increase or program statement is deleted, but the content of text of code most of identical java source program generation
Code clone is detected, and the invention detects two classes by the way of longest common subsequence, generally
Accounting of the longest common subsequence in class is higher for as soon as, illustrates that two classes are more similar, this mode is for code portions
The changed clone's detection of sequence is divided to be particularly effective.
Further, M1 ' is obtained according to the following formula: M1 '=M1-M;In formula, M1 ' is sub-sequence length exclusive in S1; M1
For S1 sequence length;M is the longest common subsequence length of S1 and S2;M2 ' is obtained according to the following formula: M2 '=M2-M;In formula, M2 '
For sub-sequence length exclusive in S2;M2 is S2 sequence length;M is the longest common subsequence length of S1 and S2.
Further, the R1 is obtained according to the following formula: R1=(M1 '/M1) × 100%;M1 ' is son exclusive in S1 in formula
Sequence length;M1 is S1 sequence length;The R2 is obtained according to the following formula: R2=(M2 '/M2) × 100%;M2 ' is in S2 in formula
Exclusive sub-sequence length;M2 is S2 sequence length.
The present invention according to above-mentioned several formula in application, can obtain the variance rate of two classes to be detected, thus according to difference
Different rate detects the similarity degree of code.
Further, the longest common subsequence S for extracting S1 and S2 includes following sub-step: enabling S1=S1(1)S1(2)…
S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk(i)Sk(i+1)…Sk(j) son
Character string, wherein k=1 or 2;Defining LCS (S1, S2) is a sequence, and the LCS (S1, S2) exists in S1 and S2, and
The sequence occurred in S1 and S2 is also identical;LCS (S1, S2) is solved, and obtains the longest common subsequence of S1 and S2
S=LCS (S1, S2).
The present invention is in application, present invention employs the LCS based on Dynamic Programming in order to search out longest common subsequence S
Algorithm improves detection efficiency.
Further, the solution of the LCS (S1, S2) uses improved dynamic programming algorithm.
Further, the variance rate threshold k is 10%~40%.
Compared with prior art, the present invention having the following advantages and benefits:
1, a kind of method towards Java source code clone's detection of the present invention is realized pair by the way that above-mentioned steps are arranged
The efficient detection of Java source code clone, compared with the prior art less expensive and efficiency greatly improves;
2, a kind of method towards Java source code clone's detection of the present invention, is directed to by analytic tree and changes, increases or delete
Except program statement, but largely identical java source program code clone is detected the content of text of code, improves detection
Precision;
3, a kind of method towards Java source code clone's detection of the present invention, by the way of longest common subsequence pair
Two classes are detected, and clone's detection changed for code section sequence is particularly effective;
4, of the invention efficiently to detect the duplicate multiple of entire code file or code snippet
With or code snippet in addition to space, annotation other than, other identical codes are simply cloned.Meanwhile further directed to change, increase
Or program statement is deleted, but largely identical java source program code clone is detected the content of text of code, and is exported
Accurate result.
Detailed description of the invention
Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application
Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the method for the present invention step schematic diagram;
Fig. 2 is schematic diagram of the embodiment of the present invention;
Fig. 3 a is schematic diagram of the embodiment of the present invention;
Fig. 3 b is schematic diagram of the embodiment of the present invention;
Fig. 4 is schematic diagram of the embodiment of the present invention;
Fig. 5 is schematic diagram of the embodiment of the present invention;
Fig. 6 is schematic diagram of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this
Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made
For limitation of the invention.
Embodiment 1
As shown in Figure 1, a kind of method towards Java source code clone's detection of the present invention, comprising the following steps: S1: extract
Class in Java source code extracts unit as minimum;S2: in java class function name and variable name carry out unified replacement;
S3: the class in Java source code is compared, and exports comparison result.
When the present embodiment is implemented, the class extracted in Java source code extracts unit as minimum, and to the letter in java class
Several and variable name carries out unified replacement so that when carrying out clone's detection to different codes, even if in class function name and
Name variable is changed, and also may be implemented the accurate comparison of class, then carrying out clone's detection to whole section of code by class can be with
The effective accuracy for improving clone's detection, this mode are duplicate multiple for entire code file or code snippet
With or code snippet in addition to space, annotation it is outer, other identical codes are simply cloned especially efficient.The present invention passes through in setting
Step is stated, realizes the efficient detection to Java source code clone, compared with the prior art less expensive and efficiency substantially mentions
It is high.
Embodiment 2
On the basis of embodiment 1, step S2 includes following sub-step to the present embodiment: scanning each java class, and will
Indicate that the terminal symbol of variable name uniformly replaces with literal in Java class;Definition is used to extend or the word of customized type
Section.
When the present embodiment is implemented, bring shadow is detected to cloned codes for the different of variable name in shielding java source program
It rings, scans each java class, will indicate that terminal symbol [id], [number], [stringlit] of variable name etc. is basic in java class
Type uniformly replaces with literal x, notes content is uniformly replaced with [comment], and customized [tokens] section is used
In extension or customized type.
Embodiment 3
On the basis of embodiment 1, step S1 includes following sub-step to the present embodiment: defining the non-end of root of a class
Symbol, and analytic tree is established using the non-end mark of root as root node;Father node in the analytic tree is corresponding more by the father node
The child node of a sequence is constituted;The father node of the analytic tree is non-end mark;The leaf node of the analytic tree is to terminate
Symbol;Described non-end mark is expressed as to the sequence of all leaf nodes.
When the present embodiment is implemented, in order to further directed to change, increase or delete program statement, but the content of text of code
Most of identical java source program code clone is detected, and the present invention parses class file, and generates analytic tree, presses
It is equivalent to according to the above method and each non-end mark is defined from top to bottom, until each non-end mark is represented as one
The end mark of series.Each java class can be thus extracted from java source program, and retains java class in source code
Initial row and the location information of end line simultaneously generate corresponding analytic tree.
Embodiment 4
For the present embodiment on the basis of embodiment 3, step S3 includes following sub-step: setting variance rate threshold k;For to
Compare class C1, remembers that its sequence is S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence length M2;
When M1 is when section [M2, (1+K) M2] is interior, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 as pre-selection gram
Grand class.
When the present embodiment is implemented, in order to judge that two classes to be compared with the presence or absence of clone's relationship, need to introduce a ginseng
For number to determine, this parameter is exactly variance rate threshold k.If the code size difference of two classes to be compared is excessive, they must
Code Clones cannot so be constituted.Therefore for class C1 to be compared, remember that its code sequence length is M1, if the currently difference of clone's class
Rate threshold value is K, then for the class C2 being compared with C1, if its code sequence length is M2, only when M1 is in section [M2, (1+
K) M2] it is inner when, C2 is likely to C1 constitute clone's class there are clone's relationship.
Embodiment 5
For the present embodiment on the basis of embodiment 3, step S3 further includes following sub-step: when C1 and C2 is pre-selection clone's class
When, the longest common subsequence S of S1 and S2 is extracted, and remember that the length of longest common subsequence S is M;Obtain son exclusive in S1
Sequence length M1 ' obtains sub-sequence length M2 ' exclusive in S2;The variance rate R1 that S1 is obtained by M1 and M1 ', by M2 and
M2 ' obtains the variance rate R2 of S2;When R1 and R2 are below K, determine that C1 and C2 clones class each other.
When the present embodiment is implemented, the sequence generated after program source code is converted and handled carries out similarity detection.
In order to further directed to change, increase or delete program statement, but the most of identical java source program of the content of text of code
Code Clones are detected, and the invention detects two classes by the way of longest common subsequence, and one
Accounting of the longest common subsequence in class is higher for as soon as, illustrates that two classes are more similar, this mode is for code
The changed clone's detection of part order is particularly effective.
Embodiment 6
On the basis of embodiment 5, M1 ' is obtained the present embodiment according to the following formula: M1 '=M1-M;In formula, M1 ' is only in S1
Some sub-sequence lengths;M1 is S1 sequence length;M is the longest common subsequence length of S1 and S2;M2 ' is obtained according to the following formula:
M2 '=M2-M;In formula, M2 ' is sub-sequence length exclusive in S2;M2 is S2 sequence length;M is that the longest of S1 and S2 is public
Sub-sequence length.The R1 is obtained according to the following formula: R1=(M1 '/M1) × 100%;M1 ' is subsequence exclusive in S1 in formula
Length;M1 is S1 sequence length;The R2 is obtained according to the following formula: R2=(M2 '/M2) × 100%;M2 ' is exclusive in S2 in formula
Sub-sequence length;M2 is S2 sequence length.
When the present embodiment is implemented, the variance rate of two classes to be detected can be obtained according to above-mentioned several formula, thus according to
Variance rate detects the similarity degree of code.
Embodiment 7
For the present embodiment on the basis of embodiment 5, the longest common subsequence S for extracting S1 and S2 includes following sub-step:
Enable S1=S1(1)S1(2)…S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk
(i)Sk(i+1)…Sk(j) substring, wherein k=1 or 2;Defining LCS (S1, S2) is a sequence, the LCS (S1,
S2) exist in S1 and S2, and the sequence occurred in S1 and S2 is also identical;LCS (S1, S2) is solved, and is obtained
The longest common subsequence S=LCS (S1, S2) of S1 and S2.
When the present embodiment is implemented, in order to search out longest common subsequence S, present invention employs based on Dynamic Programming
LCS algorithm, improves detection efficiency.
Embodiment 8
In order to which further the present invention will be described, the present embodiment is on the basis of Examples 1 to 7, using TXL language
It is handled:
Illustrate the operating mechanism of TXL language first, as shown in Figure 2:
1, syntactic analysis: this stage txl program analyses input source program according to the corresponding language grammer of definition, obtains
The analytic tree of source program.
2, generate analytic tree: this stage txl program converts entire analytic tree according to customized rewriting grammer, obtains
Analytic tree after to conversion.
3, parsing and output stage: the analytic tree after conversion is reduced to the source code of textual form by this stage txl program.
The present invention analyzes given java morphology specification, syntactic definition input using txl language, and grammer is to extend BNF
Form provides, and carries out structural description, is divided into following steps.
Step 1: defining java syntax rule, and the minimum of java program source code is arranged and extracts unit:
Write java.grm file for defining and the syntax rule of specification java language.
Syntactic definition is carried out to java language with BNF form first.Defining a java program source file is a non-knot
It is the non-end mark of root entirely inputted that beam, which accords with [program],.It is by other a series of non-end mark [classes_
Declaration] and [classes_define] composition.Wherein, non-end mark [classes_declaration] indicates class
Statement, [classes_define] indicate class definition statement.[classes_define]-is further decomposed by [classes_
Header] and [classes_body] composition, declarative statement can be divided into class name, class interface, variable name initialization etc..For
[classes_header] is in grammar file by [repeat modifier], [class_name] [opt extends_
Clause], [opt implements_clause] constitute;[classes_body] is by [class_or_interface_
Body] and [interface_header] composition.Each non-end mark is defined from top to bottom according to the method described above, always
It is represented as a series of end mark to each non-end mark, and generates analytic tree.The structure of analytic tree such as Fig. 3 a and Fig. 3 b
It is shown.
Writing minimum extract unit of the java-abstract-classes file for defining java program source code is
class.Define the non-end mark of root that [classes_define] is this document.Further decomposing [classes_define] is
A series of [classes_header] and [classes_body], [classes_body] is by [class_or_interface_
Body] and [interface_header] composition.It constantly decomposes until generating end mark.
Using TXL to realize as the programming of BNF form is only a specific embodiment of the invention, unless specifically stated, TXL
Can it is equivalent by other or have similar purpose programming language replaced.
Step 2: in java class function name and variable name carry out unified replacement:
The present invention is to shield the different of variable name in java source program to influence cloned codes detection bring, is write
Java-rename-blind-classess.txl file is to realize to judicial specification.java-rename-blind-
The Substitution Rules of classes.txl document definition function name: it scans each java class, and the end of variable name will be indicated in java class
The fundamental types such as knot symbol [id], [number], [stringlit] uniformly replace with literal x, and notes content is uniformly replaced
It is changed to [comment], and customized [tokens] section, for extension or customized type, tag definitions structure chart is such as
Shown in Fig. 3 a and Fig. 3 b.
Step 3: setting variance rate threshold value
In order to judge two classes to be compared with the presence or absence of clone's relationship, need to introduce a parameter to determine, this is joined
Number is exactly variance rate threshold.If the code size difference of two classes to be compared is excessive, they cannot necessarily constitute code gram
It is grand.Therefore for class C1 to be compared, remember that its code sequence length is M1, if currently the variance rate threshold value of clone's class is K, then for
The class C2 being compared with C1, if its code sequence length is M2, only when M2 is when section [M2, (1+K) M2] is inner, C1 is
May be with C2 there are clone's relationship, composition clones class.
Step 4: clone's detection algorithm is write
The sequence generated after program source code is converted and handled carries out similarity detection.The present invention is public using longest
Subsequence algorithm LCS (LCS, Longest Common Sequence) compares the java class extracted two-by-two altogether, to sentence
Determine whether they clone class each other.
1, class C1 and C2 to be compared for two, after C1 and C2 parsing, if the sequence of input is respectively Q1 and note Q2, length
Degree is respectively M1 and M2, extracts the Q2 longest common subsequence S of Q1 sum, the length is M for note;
2, calculate separately the length M1 ' and M2 ' of subsequence exclusive in Q1 and Q2, value be M1 '=M1-M and M2 '=
M2-M;
3, the variance rate for calculating Q1 and Q2, is denoted as R1 and R2 respectively, meets R1=(M1 '/M1) × 100%, R2=
(M2 '/M2) × 100%;
4, finally R1, R2 are compared with preset variance rate threshold k respectively, if R1 and R2 are below K, are sentenced
Determine C1 and C2 and clone class each other, otherwise C1 and C2 does not constitute clone's class.
LCS algorithm based on Dynamic Programming realizes that process is as follows:
Enable S1=S1(1)S1(2)…S1(n) and S is enabled2=S2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) table
Show.
Sk(i)Sk(i+1)…Sk(j) substring (wherein K=1 or 2).The LCS of so this two sequence is defined as follows:
Define 1S1, and S2LCS refer to and collectively reside in S1And S2As long as occur sequence and in S1And S2Interior appearance
It is sequentially identical.Enable LLCS (S1,S2) indicate S1And S2LCS length.
Such as: enable S1=AACTACC, S2=ATTACCT, then LCS (S1,S2) it is ATACC, and LCS (S1,S2) it is 5.
The present invention carries out solution LCS using improved dynamic programming algorithm.It is as follows that a recursive function is created first:
Wherein, M (i, j) indicates S1(1, i) and S2LCS length between (1, j).So M (n, m)=LLCS (S1,S2),
M is longest clone subsequence.Implementation flow chart is as shown in Figure 4.
Simultaneously before carrying out Code Clones relatively, needing to be arranged clone's class diversity factor threshold k, accepted value has 15%, 25%,
35%.If the diversity factor that two classes are calculated in LCS algorithm be respectively and P1 and P2, P1 and P2 are not higher than K then judge this two
A class clones relationship each other, and otherwise the two is not clone.
Step 5: output test result
The present invention provides result with two different representations.User can choose any one of the two or two
Person.First is the traditional text report for cloning category information, wherein each clone's class shows corresponding filename and code segment
Line number is derived from from the line number of source file and potential clone annotation.Second is visable representation, it generates a HTML page
Face shows example of first code segment as each clone's class.Each clone's class is linked to many secondary clone report pages
Face, these pages show other members of same class.It can also be seen that similarity value each pair of in class.Implementer's case stream
Journey figure is as shown in Figure 1.
The invention proposes a kind of, and the java applet source code based on LCS algorithm clones detection method, this algorithm is by looking into
It askes the longest common subsequence in java source program in inhomogeneity and calculates the similitude of detection code, sentenced by similitude
It is disconnected that whether two java classes constitute clone's relationship.Efficiently solves the detection of Code Clones different degrees of from low to high
Problem.This is function not available for other current existing Code Clones detection instruments.
Embodiment 9
For the present embodiment on the basis of embodiment 8, there are two main classes for present Code Clones detection method: first kind foundation
The tools such as content detection, including CCFinder, Clone DR.These tools use suffix tree, abstract syntax tree, program dependency graph
The methods of, to detect the similitude between code content, it can detecte out the code snippet multiplexing feelings in one or several engineerings
Condition has higher accuracy rate and recall rate.But algorithm is complicated, realizes more difficult, some excessive computing resources of consumption, and can expand
Malleability is limited, the clone's detection not being suitable in extensive code data.And these tools are partial to detect code snippet etc.
The clone of finer grain detects more suitable for the Code Clones in unitem.Second class has then used keyword lookup, file
Name matching and be not based on the method for code text content to detect Code Clones, the tools such as including IchiTrache.Such method
Fairly large data are capable of handling, but accuracy rate and recall rate are obviously not so good as the detection instrument based on content such as CCFinder.
And the present invention increases more flexible and restricted code specification and filtering for processing java program documentaion, and provides hair
Rear normalization inexact matching needed for existing many intimate miss of unstructuredness, can make different degrees of Code Clones
Efficient detection, this is functional characteristics not available for other tools.With current code gram relatively advanced, based on Token
Grand detection instrument CCFinder is compared, the present invention be able to detect that CCFinder cannot detect in addition to identification name and class
Identical code other than not, by some insertions, identical code and semantic generation identical with function other than deleting and replacing
Code clone.
It is illustrated by taking two following java classes as an example: as shown in Figure 5 and Figure 6, the change of the example of Fig. 6 to the class of Fig. 5
Amount name is modified, and is added to part comment statement.Further, declarative statement inta=0 is increased, while by intb
=0 declarative statement and this.frame=frame is exchanged.It is the code gram that can not detect the type in CCfinder
It is grand, and set corresponding threshold in the present invention and can be detected out the two classes there are clone's relationships.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of method towards Java source code clone's detection, which comprises the following steps:
S1: the class extracted in Java source code extracts unit as minimum;
S2: in java class function name and variable name carry out unified replacement;
S3: the class in Java source code is compared, and exports comparison result.
2. a kind of method towards Java source code clone's detection according to claim 1, which is characterized in that step S2 packet
Include following sub-step:
Each java class is scanned, and will indicate that the terminal symbol of variable name uniformly replaces with literal in java class;
Definition is used to extend or the field of customized type.
3. a kind of method towards Java source code clone's detection according to claim 1, which is characterized in that step S1 packet
Include following sub-step:
The non-end mark of root of a class is defined, and establishes analytic tree using the non-end mark of root as root node;
Father node in the analytic tree is made of the child node of the corresponding multiple sequences of the father node;
The father node of the analytic tree is non-end mark;
The leaf node of the analytic tree is end mark;
Described non-end mark is expressed as to the sequence of all leaf nodes.
4. a kind of method towards Java source code clone's detection according to claim 3, which is characterized in that step S3 packet
Include following sub-step:
Variance rate threshold k is set;
For class C1 to be compared, remember that its sequence is S1, sequence length M1;For class C2 to be compared, remember that its sequence is S2, sequence
Length is M2;
When M2 is when section [M2, (1+K) M2] is interior, that is, thinking C1 and C2, there may be clone's relationships, and record C1 and C2 is pre-
Choosing clone's class.
5. a kind of method towards Java source code clone's detection according to claim 4, which is characterized in that step S3 is also
Including following sub-step:
When C1 and C2 is pre-selection clone's class, the longest common subsequence S of S1 and S2 is extracted, and remember longest common subsequence S's
Length is M;
It obtains sub-sequence length M1 ' exclusive in S1, obtains sub-sequence length M2 ' exclusive in S2;
The variance rate R1 that S1 is obtained by M1 and M1 ' obtains the variance rate R2 of S2 by M2 and M2 ';
When R1 and R2 are below K, determine that C1 and C2 clones class each other.
6. a kind of method towards Java source code clone's detection according to claim 5, which is characterized in that M1 ' basis
Following formula obtains:
M1 '=M1-M;In formula, M1 ' is sub-sequence length exclusive in S1;M1 is S1 sequence length;The longest that M is S1 and S2 is public
Sub-sequence length altogether;
M2 ' is obtained according to the following formula:
M2 '=M2-M;In formula, M2 ' is sub-sequence length exclusive in S2;M2 is S2 sequence length;The longest that M is S1 and S2 is public
Sub-sequence length altogether.
7. a kind of method towards Java source code clone's detection according to claim 5, which is characterized in that the R1 root
It is obtained according to following formula:
R1=(M1 '/M1) × 100%;M1 ' is sub-sequence length exclusive in S1 in formula;M1 is S1 sequence length;
The R2 is obtained according to the following formula:
R2=(M2 '/M2) × 100%;M2 ' is sub-sequence length exclusive in S2 in formula;M2 is S2 sequence length.
8. it is according to claim 5 it is a kind of towards Java source code clone detection method, which is characterized in that extract S1 and
The longest common subsequence S of S2 includes following sub-step:
Enable S1=S1(1)S1(2)…S1(n) and S2=S is enabled2(1)S2(2)…S2It (m) is two sequence fragments;Sk(i, j) indicates Sk
(i)Sk(i+1)…Sk(j) substring, wherein k=1 or 2;
Defining LCS (S1, S2) is a sequence, and the LCS (S1, S2) exists in S1 and S2, and occurs in S1 and S2
Sequence it is also identical;
LCS (S1, S2) is solved, and obtains the longest common subsequence S=LCS (S1, S2) of S1 and S2.
9. a kind of method towards Java source code clone's detection according to claim 8, which is characterized in that the LCS
The solution of (S1, S2) uses improved dynamic programming algorithm.
10. a kind of method towards Java source code clone's detection according to claim 4, which is characterized in that the difference
Different rate threshold k is 10%~40%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811333968.0A CN109558314B (en) | 2018-11-09 | 2018-11-09 | Java source code clone detection oriented method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811333968.0A CN109558314B (en) | 2018-11-09 | 2018-11-09 | Java source code clone detection oriented method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109558314A true CN109558314A (en) | 2019-04-02 |
CN109558314B CN109558314B (en) | 2021-07-27 |
Family
ID=65865936
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811333968.0A Active CN109558314B (en) | 2018-11-09 | 2018-11-09 | Java source code clone detection oriented method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109558314B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765003A (en) * | 2019-09-24 | 2020-02-07 | 贝壳技术有限公司 | Code detection method, device and equipment, and storage medium |
CN110989991A (en) * | 2019-10-25 | 2020-04-10 | 深圳开源互联网安全技术有限公司 | Method and system for detecting source code clone open source software in application program |
CN112199115A (en) * | 2020-09-21 | 2021-01-08 | 复旦大学 | Cross-Java byte code and source code line association method based on feature similarity matching |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103262047A (en) * | 2010-12-15 | 2013-08-21 | 微软公司 | Intelligent code differencing using code clone detection |
CN103729580A (en) * | 2014-01-27 | 2014-04-16 | 国家电网公司 | Method and device for detecting software plagiarism |
CN104077147A (en) * | 2014-07-11 | 2014-10-01 | 东南大学 | Software reusing method based on code clone automatic detection and timely prompting |
US20160188885A1 (en) * | 2014-12-26 | 2016-06-30 | Korea University Research And Business Foundation | Software vulnerability analysis method and device |
CN106843840A (en) * | 2016-12-23 | 2017-06-13 | 中国科学院软件研究所 | A kind of version evolving annotation multiplexing method of source code based on similarity analysis |
US20170185783A1 (en) * | 2015-12-29 | 2017-06-29 | Sap Se | Using code similarities for improving auditing and fixing of sast-discovered code vulnerabilities |
-
2018
- 2018-11-09 CN CN201811333968.0A patent/CN109558314B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103262047A (en) * | 2010-12-15 | 2013-08-21 | 微软公司 | Intelligent code differencing using code clone detection |
CN103729580A (en) * | 2014-01-27 | 2014-04-16 | 国家电网公司 | Method and device for detecting software plagiarism |
CN104077147A (en) * | 2014-07-11 | 2014-10-01 | 东南大学 | Software reusing method based on code clone automatic detection and timely prompting |
US20160188885A1 (en) * | 2014-12-26 | 2016-06-30 | Korea University Research And Business Foundation | Software vulnerability analysis method and device |
US20170185783A1 (en) * | 2015-12-29 | 2017-06-29 | Sap Se | Using code similarities for improving auditing and fixing of sast-discovered code vulnerabilities |
CN106843840A (en) * | 2016-12-23 | 2017-06-13 | 中国科学院软件研究所 | A kind of version evolving annotation multiplexing method of source code based on similarity analysis |
Non-Patent Citations (1)
Title |
---|
苏小红等: "《面向管理的克隆代码研究综述》", 《计算机学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765003A (en) * | 2019-09-24 | 2020-02-07 | 贝壳技术有限公司 | Code detection method, device and equipment, and storage medium |
CN110989991A (en) * | 2019-10-25 | 2020-04-10 | 深圳开源互联网安全技术有限公司 | Method and system for detecting source code clone open source software in application program |
CN110989991B (en) * | 2019-10-25 | 2023-12-01 | 深圳开源互联网安全技术有限公司 | Method and system for detecting source code clone open source software in application program |
CN112199115A (en) * | 2020-09-21 | 2021-01-08 | 复旦大学 | Cross-Java byte code and source code line association method based on feature similarity matching |
Also Published As
Publication number | Publication date |
---|---|
CN109558314B (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108446540B (en) | Program code plagiarism type detection method and system based on source code multi-label graph neural network | |
CN103559444B (en) | A kind of sql injects detection method and device | |
US7506324B2 (en) | Enhanced compiled representation of transformation formats | |
Brody et al. | A structural model for contextual code changes | |
US20020143823A1 (en) | Conversion system for translating structured documents into multiple target formats | |
US20070136698A1 (en) | Method, system and apparatus for a parser for use in the processing of structured documents | |
US20060212859A1 (en) | System and method for generating XML-based language parser and writer | |
CN109558314A (en) | A method of it clones and detects towards Java source code | |
US9311058B2 (en) | Jabba language | |
Han et al. | Chinese named entity recognition with conditional random fields in the light of chinese characteristics | |
CN106649769B (en) | Semantic-based conversion method from XBRL data to OWL data | |
CN107038163A (en) | A kind of text semantic modeling method towards magnanimity internet information | |
Bai et al. | Enhanced natural language interface for web-based information retrieval | |
Zhang et al. | Hgen: Learning hierarchical heterogeneous graph encoding for math word problem solving | |
Zhou et al. | Summarizing source code with hierarchical code representation | |
CN112506488A (en) | Method for generating programming language class based on sql creating statement | |
Koskimies et al. | The design of a language processor generator | |
CN111124422A (en) | EOS intelligent contract language conversion method based on abstract syntax tree | |
Nghiem et al. | Using MathML parallel markup corpora for semantic enrichment of mathematical expressions | |
Park et al. | ALSI-Transformer: Transformer-based code comment generation with aligned lexical and syntactic information | |
Li et al. | AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction | |
KR101225333B1 (en) | System and method using tree pattern expression for extraction information from syntactically parsed text corpora | |
CN117193781B (en) | SIMSCRIPT language-oriented abstract syntax tree construction method and device | |
Schneiker et al. | Declarative Parsing and Annotation of Electronic Dictionaries. | |
Xiao | Transformation System of two Similar Syntax Programs Based on the Compiler Principle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |