CN109062792A

CN109062792A - A kind of Open Source Code detection method based on String matching and characteristic matching

Info

Publication number: CN109062792A
Application number: CN201810807404.XA
Authority: CN
Inventors: 李必信; 杨安奇; 周颖; 王璐璐; 廖力
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-07-21
Filing date: 2018-07-21
Publication date: 2018-12-21

Abstract

The Open Source Code detection method based on String matching and characteristic matching that the invention proposes a kind of further reuses and redevelopment for detecting the Open Source Code in mixed source software to facilitate developer.This method combines attribute count method and the respective advantage and disadvantage of structure measure, the search space of database is reduced first by way of characteristic matching, then the mode according to token string, thought based on string matching algorithm, to match maximum public substring, to make match time and memory consumption be advanced optimized in the Open Source Code detection for being applied to large-scale software development project.

Description

A kind of Open Source Code detection method based on String matching and characteristic matching

Technical field

The present invention relates to a kind of detection methods, and in particular to a kind of to be detected based on String matching and the Open Source Code of characteristic matching Method belongs to pattern match and data mining technology field.

Background technique

It is growing with Open Source Code, it is more and more to the detection technique research of Open Source Code, in the prior art, Relevant method is as follows: 1. text baseds compare, and by procedure division at character string, search repetition by comparing character string Character string sequence, the advantages of this method is that matching is unrelated with concrete syntax, realize it is simple and convenient, with very strong flexibility And adaptability adapts to large scale system limited the disadvantage is that this simple row matching process detection space is huge.Meanwhile the present age When code has subtle change, this technology just be can not be detected, that is to say, that it can only detect completely the same code.2. base In parameterizing matched detection technique, main thought is to catch changeless content in program, such as operator, expression formula Deng.Its advantage be exactly solve the different code repeatability test problems of variable name, and disadvantage is that it splits code, thus So that the duplicated code block very little being detected, while its space complexity is also excessively high.3. the method pair based on abstract syntax tree Language carries out syntactic analysis, establishes complete abstract syntax tree, detects using canonical algorithm and repeats subtree.In view of large software There may be a large amount of subtree in system, search space can be very big, therefore uses hash container and store all subtrees, And only hash container is compared.It is similar with text based method, it is repaired based on the detection technique of abstract syntax tree to subtle The detection effect for the duplicated code corrected one's mistakes is not ideal enough.It thereby produces.4. based on isomorphism subgraph detection weight in dependency graph (PDG) The method of multiplexed code, although this method is capable of detecting when the code after rearrangement, the algorithm time of this method is complicated Degree is up to O (n⁴), be still not suitable for the detection applied to large software.Therefore, a kind of new scheme of urgent need solves to be somebody's turn to do Technical problem.

Summary of the invention

In order to improve matching speed of the code matches in Large-Scale Projects, the present invention has used token sequences match Thought, while proposing a kind of Rapid matching mode based on code statistical nature, two methods combined, further The Time & Space Complexity of Optimized Matching, method of the program based on token String matching duplicated code, this method is first by program It is divided into token stream, then finds corresponding duplicated code by identifying duplicate token sequence.Since the method uses The syntactic information of source code, thus analyze more acurrate, and the time complexity of algorithm is relatively low, and it is large-scale to be suitable for analysis Software code.

To achieve the goals above, technical scheme is as follows: a kind of open source based on String matching and characteristic matching Code detection method, which is characterized in that the described method comprises the following steps: step 1: code characteristic is extracted；Step 2: it is based on The Open Source Code of characteristic matching detects；Step 3: the Open Source Code detection based on String matching.

As an improvement of the present invention, the step 1: code characteristic is extracted, and concrete operations are as follows, obtains source code Source code file is carried out cutting according to function, extracts the statistical nature in respective code file by file, analysis source code file It is stored in the feature database of respective code file with the structure feature of function rank, by corresponding basic statistics feature.

As an improvement of the present invention, the acquisition source code file of the step 1 includes obtaining from Open Source Code library Source code file.

As an improvement of the present invention, the step 2 is detected based on the Open Source Code of characteristic matching, specific as follows,

Obtain code file to be matched；

It searches for its feature database and obtains corresponding characteristic information；

Feature Selection is carried out with the method manually compared based on iterative calculation；

Feature composition characteristic vector after screening is done into euclidean distance metric with feature vector corresponding in feature database；

It chooses metric and is less than Candidate Set of the document code of given threshold as search.

As an improvement of the present invention, the step 3 is detected based on the Open Source Code of String matching, specific as follows, to wait Selected works are search space,

Step 1: source program being subjected to morphological analysis, converts it into token string mark；

Step 2: utilizing some syntax rules, flag sequence is divided into several segments as unit of basic program block, create Build a registration database；

Step 3: using RKR-GST string matching algorithm, obtain every section of matching string；

Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is mistake Inspection；

Step 5: exporting the Open Source Code eventually detected.

In the program, which comprises the statistics of attributes based on code file: for code text different in project Part, the file for choosing the entitled common programming language of suffix are analyzed, and carry out morphology syntactic analysis to language, statistics file has Association attributes, such as total line number, sentence number, branch statement ratio, annotation ratio, function number, average each function packet Sentence number, function cyclomatic complexity, function depth for containing etc., as the matched characteristic value to be selected of subsequent characteristics.Based on morphology Unified model building: unified model building is that can will arbitrarily change but not influence the code of program function in code file to reflect It penetrates as unified form, such as variable name is uniformly mapped as to fixed value, thus can detecte out simple modification variable name Code.Feature Selection and characteristic matching: for the association attributes counted, for code matches, some attributes for Detect code similarity relevance it is smaller, and some then relevance is very high, by test compare, remove useless attributive character, The feature vector of a measurement code similarity is ultimately formed, and determines the similitude of code according to the size of Euclidean distance. Distance is bigger, and similitude is lower, and apart from smaller, similitude is higher.It is noted that this mode based on distance metric is searched for Fast speed, time complexity are O (n), similar codes range are quickly positioned in this way, to reduce token sequence Arrange matched search space.Token sequence similarity calculates: the similitude between two program segments is their corresponding labels Similitude between string.Each string can be regarded as and be made of several substrings, then identical substring in two strings For their public substring, thus, the ratio that their similitude can be used all public substrings shared in entire string indicates.It is logical The thought with RKR-GST string matching algorithm is crossed, obtains corresponding duplicated code, and calculate similarity according to above description.

Compared with the existing technology, beneficial effects of the present invention are as follows: 1) accuracy rate of Open Source Code detection, which has, further mentions It is high；

2) speed of Open Source Code detection increases substantially.

Detailed description of the invention

Fig. 1 is the statistics of attributes schematic diagram of word-based method syntactic analysis；

Fig. 2 is the detailed data flow graph for converting source code to token string；

Fig. 3 is characterized the schematic diagram of screening and characteristic matching；

Fig. 4 is the matching flow diagram based on Token sequence；

Fig. 5 is that Open Source Code detects implementation flow chart.

Specific embodiment

In order to further enhance the appreciation and understanding of the invention, it is further described this hair with reference to the accompanying drawings and detailed description It is bright.

Embodiment 1: referring to Fig. 1-Fig. 5, a kind of Open Source Code detection method based on String matching and characteristic matching, the side Method is the following steps are included: step 1: code characteristic is extracted；Step 2: the Open Source Code detection based on characteristic matching；Step 3: Open Source Code detection based on String matching.

The step 1: code characteristic is extracted, and concrete operations are as follows, is obtained source code file, analysis source code file, is incited somebody to action Source code file according to function carry out cutting, extract respective code file in statistical nature and function rank structure feature, It will be in the feature database of corresponding basic statistics feature deposit respective code file；The acquisition source code file of the step 1 includes Source code file is obtained from Open Source Code library.

The step 2 is detected based on the Open Source Code of characteristic matching, specific as follows,

Obtain code file to be matched；

The step 3 is detected based on the Open Source Code of String matching, specific as follows, using Candidate Set as search space,

Step 5: exporting the Open Source Code eventually detected.

Below in conjunction with the attached drawing specific embodiment that the present invention is further explained.

For the method applied by the present invention, for different programming languages, the analysis of morphology and grammer is slightly Difference, but thought is changed ten thousand times without leaving the original aim or stand, and chooses mainstream speech as java, C here to explain.For Fig. 1, one is inputted Java source code, by morphological analysis and keyword match, can count class contained in source code how many, side operator How many a and some basic lines of code, keyword number, branch statement ratio, the basic statistics such as annotation ratio letter Breath, and then syntactic analysis is carried out again, obtain the cyclomatic complexity of function, the structurings statistical information such as function depth, and these are believed Breath carries out layout arrangement, obtains the attributive analysis report an of source file, this analysis program can be to all Open Source Codes File is analyzed, and the statistical result of respective project respective file is stored in database, as the matched spy of subsequent characteristics Levy library.

In order to correctly detect related replicating code, detection process is allowed for ignoring the difference on program surface, be concentrated In more substantive ingredient.Therefore, it needs to carry out morphological analysis to source code before comparison, because morphology is determining constant , professional etiquette of going forward side by side is formatted processing, is weeded out the attribute unrelated with program structure, such as variable name, annotation, blank, is counted energy The feature for reflecting program structure information, can just make the obtained result of similarity system design more accurate in this way.Normalization is handled So-called unified model building, it is meant that it goes to replace some lexical elements with a defined general identifications string, and for those The sentence for directly influencing sentence semantics or program structure will remain unchanged.Such as: for C language source program, it can make such as Lower processing is to generate string.1. identifier (variable name, method name etc.), only describes data used in program, can use Unified special marking replaces each identifier；2. number and character constant, although constant will not during execution Variation, but may but be changed in the design process.But the variation does not have much influence analysis code logic, Therefore, constant would generally be blurred in analysis, replaces all constants with a mark or is directly ignored.3. mapping is same Adopted word is a kind of common form, such as basic data type int, long, short uniformly replace with common tags, such as type.4. the symbols such as annotation, space, tab, newline should be ignored, 5. other crucial words, operator, logic symbol, The symbol that separator etc. directly influences the internal logic function of program is then retained constant.That is stated in above procedure is detailed Thin source code preprocessed data flow graph is as shown in Figure 2.

After the completion of related token sequence and characteristic vector pickup, be first depending on feature vector, i.e., multiple statistical attributes, The minimum particle size for paying attention to statistical attribute here is function rank, because if if being the other statistical attribute of statement level, not It is enough to illustrate that it is non-code of living alone, therefore being compared using function as minimum particle size is to compare clearly.Firstly, since statistics Attribute has very much, some attributes perhaps can not react the similitude of code, such as the number of class, the number of method, and have A little attributes are for judging whether it is similar codes and its important, such as cyclomatic complexity, function depth, therefore as shown in figure 3, first Choose more appropriate characteristic value by artificial experience, composition characteristic vector, by the way that feature vector is normalized, and foundation Euclidean distance formulaCalculate distance, distance more similar apart from smaller representative code Bigger representative code is not identical.By setting a threshold value p, if metric range is greater than threshold value, illustrate the feature vector used It is not enough to match similar codes, thus need to carry out feature recombination, and if distance is less than threshold value, it is further artificial Compare the similarity of two sections of codes, if the artificial result relatively obtained and measurement results and given threshold is closely located, just Think that the feature vector chosen is more appropriate, otherwise screen feature by adjusting the corresponding weight of characteristic attribute or further, comes It re-starts distance to calculate, eventually finds matching characteristic vector the most suitable.It here is actually to use iteration combination people As a result the thought of work inspection needs artificial comparison, thus the further quality of judging characteristic.Certainly pass through handmarking's sample This, the statistical learning of two classification problem is carried out using sample as data set, and evaluates training according to the mode of cross validation Index can also achieve the purpose that Feature Selection so as to adjust characteristic parameter, but this mode need certain artificial mark at This, thus the mode for having selected iteration artificial judgment to choose here, it can also reach good effect after testing.

By the characteristic matching of previous step, we can the cracking range for reducing database search, because of side above Method is to need to calculate the distance of feature vector, and the intrinsic dimensionality that we extract is limited, therefore matched time complexity is O (n) rank.However feature-based matching can only be rough expression code some similar structures, it is whether genuine as code It is similar or equal, it can not accurately measure, it is played the role of being exactly to greatly reduce searching based on token sequences match Rope space, this looks for the people of a understanding just as us in the boundless and indistinct sea of faces, we can be according to the surface to him, example Such as height, weight etc. filters out a group of people, then those people with similar structure of main detection, to accelerate our search Process.Here is the specific method of token sequences match, as shown in figure 4, it specific step is as follows:

Step 5: exporting the Open Source Code eventually detected.

A batch documents conversion program is designed first, and morphology is carried out to all programming language files in destination folder Analysis, and corresponding text file is generated, for storing the token string of language file generation.Each corresponding text text Path under destination folder where part not only records corresponding language file, but also have recorded each token string and exist Line number where in original program, for use in being positioned later to the duplicated code detected.

In order to make traditional lexical analyzer can adapt to the detection of duplicated code, the present invention improves it, right All token words have carried out Unify legislation model construction, and the normalisation rule of formulation is as follows:

1. keyword: all keywords, lowercase are converted into capitalization, such as: " if " is converted into " IF "；

2. bound symbol: all bound symbols do not change in conversion, such as: " ++ " still keeps " ++ " constant；

3. constant: being divided into numeric type, character types and character string type three classes.The constant of all numeric types is converted C_fig is converted to for " language title _ fig ", such as C language；The constant of all character types is converted to " language title _ har"；The constant of all character string types is converted into " language title _ str ".

4. variable name: all variable names are all converted into the format of B_X1X2.Wherein B_ indicates that this is a variable name, Any situation is all constant；The data type (such as shaping, character type etc.) of X1 expression variable；X2 indicate variable classification (such as Common variables, array, pointer, array of pointers etc.).

5. function name: all function names are all converted into H_T format, and wherein H_ indicates that this is a variable name, and T is indicated The type of function return value.

By being standardized conversion to token string, the renaming phenomenon of duplicated code segment identifier can be tolerated.It is above-mentioned The similar identifier of certain types is mapped as identical token word by rule, this is because after replicating stickup, source code The data type of a little variables may need to change, such as shaping variables transformations are floating type variable or other data types. By the way that the close but different variable name of data type is mapped as identical token word, in matching below, this two sections of generations Code will be found out in the form of duplicated code, the missing inspection without the renaming or Type Change because of variable name, equally, In order to tolerate the renaming of function name, the similar function name of all return Value Types is all mapped as identical token word.

After having converted all source programs, source program becomes a very long character string.We are according to function grade Other granularity divides this sequence, forms registration database, and why dividing is because of the use of Open Source Code sometimes It may be the use to a module, rather than entire file, being matched and integrated according to division in this way makes Open Source Code Matching it is more perfect.Here according to simple syntactic information in source code, source program is divided into several basic modules, each Unit of the module as sequences segmentation.The basis of this dividing method is: duplicated code is usually by one or more consecutive base This section composition.By that can find out basic code segments, then combine them into using basic block as segmentation unit Bigger code segment.

Then it is namely based on the matching way of segmentation token string, RKR-GST string matching algorithm is employed herein, RKR-GST algorithm is that famous program code plagiarizes inspection software YAP3 and internet program code copy detection online service system The detection algorithm that Jplag is used, by the Wise M J of Sydney University in introducing KR (Karp Rabin) string matching algorithm in 1993 Thought, GST (Greedy String Tiling) algorithm improved, is proposed.The String matching that RKR-GST algorithm solves Problem is described as following form:

Input: text string T=" t₀t₁…t_n-1", it is the character type string that a length is n；Pattern string P=" p₀p₁… p_m-1", it is the character type string that a length is m.Wherein, n, m are positive integer, and each element in n >=m, T and P is limited In character set Σ.

Output: the maximum matching substring set substring of T and P, and between any two element in substring It is not overlapped, does not also mutually include.

The maximum matching substring of T and P is defined as, it is assumed that long in P since j-th of position in T since i-th of position Degree is equal to each other for each element of the substring of k, and t_i+k≠p_j+k, or at this time in T and P at least one string has been tied Beam or further element are included in other maximum matching strings, then claim " t_it_i+1…t_i+k-1" (or " p_jp_j+1…p_j+k-1") it is T The maximum matching substring that a length with P is k, is denoted as max_match (i, j, k).

Similitude applied to program code detects, and RKR-GST algorithm needs to be arranged one and is referred to as smallest match length Parameter.The parameter is for being arranged the permitted smallest match of tab character string that two program code conversions to be compared obtain Length will be ignored less than the matching of this length and disregard.For example, there is identical character string in two program codes " while " is normal.

Its key step is as follows:

Step 1: initialization substring is sky, and the initial value of search length s is arranged；

Step 2: calculating the hashed value for all substrings that unmarked length in text string T is s, and be stored in hash knot In structure Hash_structure.

Step 3: calculating the hashed value for the substring that each unmarked length in pattern string P is s, and and Hash_ Hashed value in structure is compared, if equal, extends matching, it may be assumed that if " t_it_i+1…t_i+s-1" and " p_jp_j+1… p_j+s-1" hashed value it is equal, then compare t_i+sWith p_j+sIf equal, continue to compare t_i+s+1With p_j+s+1..., it is until to be matched Two characters differ, or one of them has been coupled with label or end of string.Using the extension obtained substring of matching as Candidate maximum matching substring record.

Step 4: to all candidate maximum matching substrings obtained in step 3, according to the descending sequence of length into The following label processing of row: if text substring corresponding with current candidate maximum matching substring and each character of mode substring are corresponding It is equal, then following processing is carried out, otherwise removal current candidate maximum matches substring, place from candidate maximum matching substring queue Manage next candidate maximum matching substring.

If current candidate maximum matches substring and the arbitrary element in set substring is not be overlapped, it is added into In set substring, and it is marked respectively in text string T and pattern string P；

If current candidate maximum matches substring and a certain element in set substring there are Chong Die, further sentence Whether the remaining substring length after disconnected removal lap is greater than or equal to smallest match length s, if it is, by remaining substring It is added in candidate maximum matching substring queue lists, otherwise abandons.

Step 5: from Step 2 to Step 4 is repeated, until it is maximum more than or equal to the candidate of s to find new length Match substring.

Step 6: returning to set substring.

What is saved in set substring is all maximum matching of the length not less than s of text string T and pattern string P String and its position and length information in T and P.The similarity of text string T and pattern string P can be calculated as follows:

Wherein, | substring | indicate that all maximum matches the sum of the length of substring in set substring, | P | and | T | respectively indicate the length of text string T and pattern string P.

RKR-GST string matching algorithm annotates modification, typesetting, modified logo accord with, adjust timing in code block again The anti-detection means such as sequence, process combination and the cancellation function call of unrelated sentence have preferable detection effect.

In conclusion we pass through the statistical nature construction feature vector of extraction procedure code, carried out based on Euclidean distance Similar codes screening, and then database search range is reduced, then final generation is carried out with RKR-GST algorithm by token string Code matching, so that the speed of matching search gets a promotion, is finally opened from open source library in the case where guaranteeing the accurate situation of result The testing result of source code.

The above is only the embodiment that the present invention is applied to C language and java language, it is noted that for this technology For the those of ordinary skill in field, various improvements and modifications may be made without departing from the principle of the present invention, this A little improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of Open Source Code detection method based on String matching and characteristic matching, which is characterized in that the method includes following Step: step 1: code characteristic is extracted；Step 2: the Open Source Code detection based on characteristic matching；Step 3: it is based on String matching Open Source Code detection.

2. the Open Source Code detection method according to claim 1 based on String matching and characteristic matching, which is characterized in that institute State step 1: code characteristic is extracted, and concrete operations are as follows, obtains source code file, analysis source code file, by source code file Carry out cutting according to function, the structure feature of extracting statistical nature and function rank in respective code file, by corresponding base This statistical nature is stored in the feature database of respective code file.

3. the Open Source Code detection method according to claim 2 based on String matching and characteristic matching, which is characterized in that institute The acquisition source code file for stating step 1 includes obtaining source code file from Open Source Code library.

4. the Open Source Code detection method according to claim 3 based on String matching and characteristic matching, which is characterized in that institute Step 2 is stated to detect based on the Open Source Code of characteristic matching, it is specific as follows,

Obtain code file to be matched；

5. the Open Source Code detection method according to claim 4 based on String matching and characteristic matching, which is characterized in that institute Step 3 is stated to detect based on the Open Source Code of String matching, it is specific as follows, using Candidate Set as search space,

Step 2: utilizing some syntax rules, flag sequence is divided into several segments, creation one as unit of basic program block A registration database；

Step 4: lesser adjacent segment being combined into biggish duplicated code segment, and judges whether this code is erroneous detection；

Step 5: exporting the Open Source Code eventually detected.